Improving `get_metadata()`

IgnacioHeredia commented 2 months ago

@BorjaEst @vykozlov

Wouldn't it make sense to add the user metadata (ai4-metadata.yml) to the function get_metadata() in addition to the distribution metadata? To avoid breaking backward compatibility we can return them in a specific field (eg. user metadata?)

BorjaEst commented 1 month ago

Yes, it has for me. The only problem I have is about redundant information.

I would rather try to put all from metadata in pyproject.toml using different sections such deepaas, ai4os or tool, see: https://stackoverflow.com/questions/76924873/can-i-add-custom-data-to-a-pyproject-toml-file

However I suppose is a matter of taste? Specifically I am worried about how to handle cases where information does not match:

Authors are different
Project name are different
etc.

I know by the template it should match at the creation of the project, but if a user changes something, they should find out that they have to do it at both places.

If I understand correclty it should be enough to parse yaml -> dict and then return with get_metadata the parsed result as json instead of the package information right?

Two steps to keep in mind:

Add file to pyproject.toml so the file location of metadata is related to the installation path (although with -e is not need, is the normal thing to do)
Edit get_metadata to open file and return? We should still keep dynamic info like the models available? or should be something the user has to edit from the metadata? That would remove the flexibility to return the available models without restart the code.

IgnacioHeredia commented 1 month ago

Yes, it has for me. The only problem I have is about redundant information.

I would rather try to put all from metadata in pyproject.toml using different sections such deepaas, ai4os or tool, see: https://stackoverflow.com/questions/76924873/can-i-add-custom-data-to-a-pyproject-toml-file

However I suppose is a matter of taste? Specifically I am worried about how to handle cases where information does not match:

Two possibilities:

embrace redundancy, and have two fields in get_metadata: one to show package metadata and another with YAML metadata.
avoid redundancy and have single metadata. In this case, I think YAML values should have priority as they are more relevant.

If I understand correclty it should be enough to parse yaml -> dict and then return with get_metadata the parsed result as json instead of the package information right?

Yes.

Two steps to keep in mind:

Add file to pyproject.toml so the file location of metadata is related to the installation path (although with -e is not need, is the normal thing to do)

Edit get_metadata to open file and return? We should still keep dynamic info like the models available? or should be something the user has to edit from the metadata? That would remove the flexibility to return the available models without restart the code.

Not sure of what you mean by available models. Can you point me to an example in code?

BorjaEst commented 1 month ago

Sure, I will also add a bit more of context. In most of our projects, we need to parametrize the inference method call. In this case, with the name of the model that we want to use or the in some cases even a dataset or similar. However, there is no "pure" endpoint to tell to the user which models are available, therefore we use get_metadata to provide a list of available input for a specific argument (for example models).

See: https://github.com/ai4os-hub/ai4os-yolov8-torch/blob/4e1a043664917bfa2d978f1b118dc84aefae3352/api/__init__.py#L61-L62

Returns a field "models_remote" and "models_local" with the values that the inference can accept: https://github.com/ai4os-hub/ai4os-yolov8-torch/blob/4e1a043664917bfa2d978f1b118dc84aefae3352/api/schemas.py#L55-L69

IgnacioHeredia commented 1 month ago

I see.

Is this method capable of detecting a newly avaialable model, created after deepaas was launched? Because the standard procedure has always been to the restart deepaas in that case. Because the parameters are generated at launch time, the new model is not detected.

BorjaEst commented 1 month ago

Yes, that is the thing, it is possible to do it without restart. However, I suppose we can still do the same:

avoid redundancy using preference: 1st Code, 2nd Metadata YAML, 3rd pyproject.toml

For example the method get_methadata can override fields to update the pressent models.

BorjaEst commented 1 month ago

btw, I see we have now ai4-metadata.yml and metadata.json. Is this information redundant? Are we ready to rm metadata.json?

BorjaEst commented 1 month ago

I have been working on this, the main inconvenient I have found when using "ai4-metadata.yaml" is the installation and testing via tox: according to https://setuptools.pypa.io/en/latest/userguide/datafiles.html, package data should be located on the package folder (not the root folder). It is not an issue so far users install as "editable" environment (-e), but it reflects issues in tox (because tox tests always without "editable" and therefore "ai4-metadata.yaml" is not available).

Do you have any suggestion to bypass this? I only see 3 "correct" options:

move "ai4-meatadata.yml" inside package directory
use "pyproject.toml" to add additional metadata information
last if it cannot be helped: https://setuptools.pypa.io/en/latest/userguide/extension.html#defining-additional-metadata

The last one, defines a script which extends the installation egg_info metadata.

Also, this is again product of the issue of using half way for installation. Sometimes we treat the modules as installable, and other times not.

https://github.com/ai4os/DEEPaaS/issues/135

IgnacioHeredia commented 1 month ago

btw, I see we have now ai4-metadata.yml and metadata.json. Is this information redundant? Are we ready to rm metadata.json?

We are in the process of migrating from JSON to YAML. When Dashboard is ready to only show YAML, we can remove JSON from the template.

Do you have any suggestion to bypass this? I only see 3 "correct" options:

move "ai4-meatadata.yml" inside package directory

Not possible. PAPI needs to know where the metadata file is located. And PAPI only knows module repo url, not the package name. Technically, it might be possible if PAPI loaded the toml and looked for package name. But that would make the pipeline more complex and error-prone, as well as increase the latency (more calls to Github). So I would discard this (at least for the moment).

In addition, aesthetically, I don't find it very nice to have the metadata together with the code.

use "pyproject.toml" to add additional metadata information

But then this would mean having the metadata duplicated right? Having the same information both in the TOML and the YAML. This can lead to inconsistencies, because people update the YAML, not the TOML.

last if it cannot be helped: https://setuptools.pypa.io/en/latest/userguide/extension.html#defining-additional-metadata

The last one, defines a script which extends the installation egg_info metadata.

Well, this does not look too bad, it is our usecase right? The main issue I see is that the metadata would freeze to their values at the time of the installation right? Correct me if I'm wrong. That's why I was initially saying to dynamically load the YAML in get_metadata, to also reflect any posterior changes that might happen.

Also, this is again product of the issue of using half way for installation. Sometimes we treat the modules as installable, and other times not.

ai4os/DEEPaaS#135

Well the issue says it's on the roadmap. Given this and the fact the the issue is not super-urgent, maybe we can leave this on hold until DEEPaaS improves.

However, I would prefer to solve this issue now because I don't know when Álvaro will have time to keep working on DEEPaaS 3.0. So if we can solve it without relying on DEEPaaS, that's better.

BorjaEst commented 1 month ago

What about the folloing till v3.0?

Ignore the metadata comming from package (pyproject.tom) and create an environment variable that points to the ai4-metadata.yml. The trick might be to configure the environment variable correctly such for example "{cwd}/../{project_name}" so it points to the correct one in case that multiple projects are installed.

The reason whay I do not recommend to use the "installation path" is because we will have same problem with tox, the ai4-metadata.yml is not in the distribution package.

IgnacioHeredia commented 1 month ago

Sound fine to me!

BorjaEst commented 1 month ago

ok, I will prepare a PR for this

BorjaEst commented 1 month ago

Some fields are redundant:

{
  "id": "temp2",
  "name": "temp2",
  "links": {
    "ai4_template": "ai4-template/2.1.5",
    "source_code": "https://github.com/ai4os-hub/temp2",
    "docker_image": "ai4oshub/temp2"
  },
  "Metadata-Version": "2.1",
  "Name": "temp2",
  "Version": "0.0.1",
  "Summary": "Some temp project",
  "Author-email": "Borja <borja.est@example.com>",
  "License": "MIT",
  "Project-URL": "Homepage, https://github.com/ai4os-hub/temp2",
  "Classifier": "Intended Audience :: Information Technology",
  "Requires-Python": ">=3.8",
  "Description-Content-Type": "text/markdown",
  "License-File": "LICENSE",
  "Requires-Dist": "webargs ~=5.5.3",
  "Description": "# temp2\n[![Build Status](https://jenkins.services.ai4os.eu/buildStatus/icon?job=AI4OS-hub/temp2/main)](https://jenkins.services.ai4os.eu/job/AI4OS-hub/job/temp2/job/main/)\n\nSome temp project\n\nTo launch it, first install the package then run [deepaas](https://github.com/ai4os/DEEPaaS):\n```bash\ngit clone https://github.com/ai4os-hub/temp2\ncd temp2\npip install -e .\ndeepaas-run --listen-ip 0.0.0.0\n```\n\n## Project structure\n```\n│\n├── Dockerfile             <- Describes main steps on integration of DEEPaaS API and\n│                             temp2 application in one Docker image\n│\n├── Jenkinsfile            <- Describes basic Jenkins CI/CD pipeline (see .sqa/)\n│\n├── LICENSE                <- License file\n│\n├── README.md              <- The top-level README for developers using this project.\n│\n├── VERSION                <- temp2 version file\n│\n├── .sqa/                  <- CI/CD configuration files\n│\n├── temp2    <- Source code for use in this project.\n│   │\n│   ├── __init__.py        <- Makes temp2 a Python module\n│   │\n│   ├── api.py             <- Main script for the integration with DEEPaaS API\n│   |\n│   ├── config.py          <- Configuration file to define Constants used across temp2\n│   │\n│   └── misc.py            <- Misc functions that were helpful accross projects\n│\n├── data/                  <- Folder to store the data\n│\n├── models/                <- Folder to store models\n│   \n├── tests/                 <- Scripts to perfrom code testing\n|\n├── metadata.json          <- Metadata information propagated to the AI4OS Hub\n│\n├── pyproject.toml         <- a configuration file used by packaging tools, so temp2\n│                             can be imported or installed with  `pip install -e .`                             \n│\n├── requirements.txt       <- The requirements file for reproducing the analysis environment, i.e.\n│                             contains a list of packages needed to make temp2 work\n│\n├── requirements-test.txt  <- The requirements file for running code tests (see tests/ directory)\n│\n└── tox.ini                <- Configuration file for the tox tool used for testing (see .sqa/)\n```\n",
  "Author-emails": {
    "Borja": "borja.est@example.com"
  },
  "Authors": [
    "Borja"
  ],
  "metadata_version": "2.0.0",
  "title": "temp2",
  "summary": "Some temp project",
  "description": "temp2 is an application using the DEEPaaS API.\nSome temp project",
  "dates": {
    "created": "2024-10-01",
    "updated": "2024-10-01"
  }
}

BorjaEst commented 1 month ago

Here is the list of fields need to rework:

"id", "name", "Name", "title",
"links", "Project-URL",
"Metadata-Version", "metadata_version",
"Version": Software version
"Summary", "summary",
"Description", "description",
"Authors",
"Author-email", "Author-emails"
"License", "License-File", "LICENSE",

Metadata-Version is the version for importlib.metadata, I do not see there is a easy outcome from merging this 2 things together. I propose the following options:

Ignore importlib.metadata and use only metadata from ai4-metadata.yml
Make a subfield package-metadatata and put the importlib.metadata there.

I vote for 1

BorjaEst commented 1 month ago

metadata "date" is giving problems but I see you are fixing it in the other PR

BorjaEst commented 1 month ago

Left, metadata from "importlib", left metadata from "ai4-metadata"

vykozlov commented 1 month ago

@BorjaEst , @IgnacioHeredia I think, merging different metadata creates a bigger mess, sorry. pyproject.toml is the python standard and we insist in having it. Possibly: a) we return only ai4-metadata.yml info disadvantages:

does not have "author", "license", "version" etc. i.e. have to retrieve anyway
breaks backwards compatibility with old get_metadata(), also will not pass current CI/CD checks (i.e. have to be updated)

b) return exactly the same metadata as before, but add one more field: "ai4-metadata": github_url/ai4-metadata.yml i.e. not the full a4-metadata.yml is returned but the link to it advantages:

does not break the compatibility
ai4-metadata.yml is easily foundable

BorjaEst commented 1 month ago

ok, I can fix it that way.

BorjaEst commented 1 month ago

This would be the output, please confirm before I merge:

{
  "id": "temp1",
  "name": "temp1",
  "links": [
    {
      "rel": "self",
      "href": "/v2/models/temp1"
    }
  ],
  "Metadata-Version": "2.1",
  "Name": "temp1",
  "Version": "0.0.1",
  "Summary": "asdfsdf",
  "Author-email": "asdf <asdf@asdf.sdf>",
  "License": "MIT",
  "Project-URL": "Homepage, https://github.com/ai4os-hub/temp1",
  "Classifier": "Intended Audience :: Information Technology",
  "Requires-Python": ">=3.8",
  "Description-Content-Type": "text/markdown",
  "License-File": "LICENSE",
  "Requires-Dist": "webargs ~=5.5.3",
  "Description": "# temp1\n[![Build Status](https://jenkins.services.ai4os.eu/buildStatus/icon?job=AI4OS-hub/temp1/main)](https://jenkins.services.ai4os.eu/job/AI4OS-hub/job/temp1/job/main/)\n\nasdfsdf\n\nTo launch it, first install the package then run [deepaas](https://github.com/ai4os/DEEPaaS):\n```bash\ngit clone https://github.com/ai4os-hub/temp1\ncd temp1\npip install -e .\ndeepaas-run --listen-ip 0.0.0.0\n```\n\n## Project structure\n```\n│\n├── Dockerfile             <- Describes main steps on integration of DEEPaaS API and\n│                             temp1 application in one Docker image\n│\n├── Jenkinsfile            <- Describes basic Jenkins CI/CD pipeline (see .sqa/)\n│\n├── LICENSE                <- License file\n│\n├── README.md              <- The top-level README for developers using this project.\n│\n├── VERSION                <- temp1 version file\n│\n├── .sqa/                  <- CI/CD configuration files\n│\n├── temp1    <- Source code for use in this project.\n│   │\n│   ├── __init__.py        <- Makes temp1 a Python module\n│   │\n│   ├── api.py             <- Main script for the integration with DEEPaaS API\n│   |\n│   ├── config.py          <- Configuration file to define Constants used across temp1\n│   │\n│   └── misc.py            <- Misc functions that were helpful accross projects\n│\n├── data/                  <- Folder to store the data\n│\n├── models/                <- Folder to store models\n│   \n├── tests/                 <- Scripts to perfrom code testing\n|\n├── metadata.json          <- Metadata information propagated to the AI4OS Hub\n│\n├── pyproject.toml         <- a configuration file used by packaging tools, so temp1\n│                             can be imported or installed with  `pip install -e .`                             \n│\n├── requirements.txt       <- The requirements file for reproducing the analysis environment, i.e.\n│                             contains a list of packages needed to make temp1 work\n│\n├── requirements-test.txt  <- The requirements file for running code tests (see tests/ directory)\n│\n└── tox.ini                <- Configuration file for the tox tool used for testing (see .sqa/)\n```\n",
  "Author-emails": {
    "asdf": "asdf@asdf.sdf"
  },
  "Authors": [
    "asdf"
  ],
  "AI4-Metadata": {
    "metadata_version": "2.0.0",
    "title": "temp1",
    "summary": "asdfsdf",
    "description": "temp1 is an application using the DEEPaaS API.\nasdfsdf",
    "dates": {
      "created": "2024-10-09",
      "updated": "2024-10-09"
    },
    "links": {
      "ai4_template": "ai4-template/2.1.5",
      "source_code": "https://github.com/ai4os-hub/temp1",
      "docker_image": "ai4oshub/temp1"
    }
  }
}

Note the following:

  "id": "temp1",
  "name": "temp1",
  "links": [
    {
      "rel": "self",
      "href": "/v2/models/temp1"
    }
  ],

Are added by DEEPaaS, not by importlib, but I suppose it really does not matter as we are also modifying the authors and emails format.

Some of fields are capital and other are not.

IgnacioHeredia commented 1 month ago

@vykozlov I agree with your points:

we should maintain backwards compatibility if possible,
some TOML info is not in the YAML metadata (eg. author)

That's why initially we discussed including the AI4 metadata in a separate key, just as Borja shows in the previous comment. As a dict, not as a link to Github link. Because otherwise you need internet connection to read the metadata.

If the output is messy it's because our workflow can be messy and the same info can be repeated both in TOML and YAML, sometimes becoming unsynced if you change one but not the other. All in all, I agree with returning both TOML and YAML, only that YAML should be returned as dict, not link.

This would be the output, please confirm before I merge:

Confirmed, from my side.

vykozlov commented 1 month ago

:exclamation: I am moving discussion to our internal emails :exclamation:

maybe to be clear, I find the approach of including already existing toml|yml files messy. Think of:

ai4-metadata.yml "description:" field can be a very long text. This is full text prepared for the marketplace, potentially with HTML links to images, References, what's not.
our current CI/CD does not rebuild Docker images, if only ai4-metadata.yml was updated. Simple reason, testing full application and building Docker image takes time and resources, so no sense to rebuild if only some info in "ai4-metadata.yml" was updated. In the other words, get_metadata() may not return proper ai4-metadata.yml

The suggestion was to have "current metadata + ai4-metadata link" E.g.

{
  "id": "ai4os_demo_app",
  "name": "ai4os_demo_app",
  "links": [
    {
      "rel": "self",
      "href": "/v2/models/ai4os_demo_app"
    }
  ],
  "version": "0.0.1.dev29",
  "author": "Max Mustermann",
  "author-email": "mustermann@some-email.eu",
  "license": "MIT"
  "ai4-metadata": https://raw.githubusercontent.com/ai4os-hub/ai4os-demo-app/refs/heads/main/ai4-metadata.yml
}

but maybe even "ai4-metadata.yml" URL is a bit redundant, as just GitHub repo link should be enough

ai4os / ai4-template

Improving `get_metadata()` #9