Open IgnacioHeredia opened 2 months ago
Yes, it has for me. The only problem I have is about redundant information.
I would rather try to put all from metadata in pyproject.toml using different sections such deepaas
, ai4os
or tool
, see:
https://stackoverflow.com/questions/76924873/can-i-add-custom-data-to-a-pyproject-toml-file
However I suppose is a matter of taste? Specifically I am worried about how to handle cases where information does not match:
I know by the template it should match at the creation of the project, but if a user changes something, they should find out that they have to do it at both places.
If I understand correclty it should be enough to parse yaml -> dict and then return with get_metadata
the parsed result as json instead of the package information right?
Two steps to keep in mind:
metadata
is related to the installation path (although with -e is not need, is the normal thing to do)Yes, it has for me. The only problem I have is about redundant information.
I would rather try to put all from metadata in pyproject.toml using different sections such
deepaas
,ai4os
ortool
, see: https://stackoverflow.com/questions/76924873/can-i-add-custom-data-to-a-pyproject-toml-fileHowever I suppose is a matter of taste? Specifically I am worried about how to handle cases where information does not match:
Two possibilities:
get_metadata
: one to show package metadata and another with YAML metadata.If I understand correclty it should be enough to parse yaml -> dict and then return with
get_metadata
the parsed result as json instead of the package information right?
Yes.
Two steps to keep in mind:
- Add file to pyproject.toml so the file location of
metadata
is related to the installation path (although with -e is not need, is the normal thing to do)- Edit get_metadata to open file and return? We should still keep dynamic info like the models available? or should be something the user has to edit from the metadata? That would remove the flexibility to return the available models without restart the code.
Not sure of what you mean by available models. Can you point me to an example in code?
Sure, I will also add a bit more of context.
In most of our projects, we need to parametrize the inference method call. In this case, with the name of the model that we want to use or the in some cases even a dataset or similar. However, there is no "pure" endpoint to tell to the user which models are available, therefore we use get_metadata
to provide a list of available input for a specific argument (for example models).
Returns a field "models_remote" and "models_local" with the values that the inference can accept: https://github.com/ai4os-hub/ai4os-yolov8-torch/blob/4e1a043664917bfa2d978f1b118dc84aefae3352/api/schemas.py#L55-L69
I see.
Is this method capable of detecting a newly avaialable model, created after deepaas was launched? Because the standard procedure has always been to the restart deepaas in that case. Because the parameters are generated at launch time, the new model is not detected.
Yes, that is the thing, it is possible to do it without restart. However, I suppose we can still do the same:
For example the method get_methadata
can override fields to update the pressent models.
btw, I see we have now ai4-metadata.yml
and metadata.json
.
Is this information redundant? Are we ready to rm metadata.json
?
I have been working on this, the main inconvenient I have found when using "ai4-metadata.yaml" is the installation and testing via tox
: according to https://setuptools.pypa.io/en/latest/userguide/datafiles.html, package data should be located on the package folder (not the root folder). It is not an issue so far users install as "editable" environment (-e), but it reflects issues in tox (because tox tests always without "editable" and therefore "ai4-metadata.yaml" is not available).
Do you have any suggestion to bypass this? I only see 3 "correct" options:
The last one, defines a script which extends the installation egg_info metadata.
Also, this is again product of the issue of using half way for installation. Sometimes we treat the modules as installable, and other times not.
btw, I see we have now
ai4-metadata.yml
andmetadata.json
. Is this information redundant? Are we ready to rmmetadata.json
?
We are in the process of migrating from JSON to YAML. When Dashboard is ready to only show YAML, we can remove JSON from the template.
Do you have any suggestion to bypass this? I only see 3 "correct" options:
- move "ai4-meatadata.yml" inside package directory
Not possible. PAPI needs to know where the metadata file is located. And PAPI only knows module repo url, not the package name. Technically, it might be possible if PAPI loaded the toml and looked for package name. But that would make the pipeline more complex and error-prone, as well as increase the latency (more calls to Github). So I would discard this (at least for the moment).
In addition, aesthetically, I don't find it very nice to have the metadata together with the code.
- use "pyproject.toml" to add additional metadata information
But then this would mean having the metadata duplicated right? Having the same information both in the TOML and the YAML. This can lead to inconsistencies, because people update the YAML, not the TOML.
- last if it cannot be helped: https://setuptools.pypa.io/en/latest/userguide/extension.html#defining-additional-metadata
The last one, defines a script which extends the installation egg_info metadata.
Well, this does not look too bad, it is our usecase right? The main issue I see is that the metadata would freeze to their values at the time of the installation right? Correct me if I'm wrong. That's why I was initially saying to dynamically load the YAML in get_metadata
, to also reflect any posterior changes that might happen.
Also, this is again product of the issue of using half way for installation. Sometimes we treat the modules as installable, and other times not.
Well the issue says it's on the roadmap. Given this and the fact the the issue is not super-urgent, maybe we can leave this on hold until DEEPaaS improves.
However, I would prefer to solve this issue now because I don't know when Álvaro will have time to keep working on DEEPaaS 3.0. So if we can solve it without relying on DEEPaaS, that's better.
What about the folloing till v3.0?
Ignore the metadata comming from package (pyproject.tom
) and create an environment variable that points to the ai4-metadata.yml
. The trick might be to configure the environment variable correctly such for example "{cwd}/../{project_name}" so it points to the correct one in case that multiple projects are installed.
The reason whay I do not recommend to use the "installation path" is because we will have same problem with tox
, the ai4-metadata.yml
is not in the distribution package.
Sound fine to me!
ok, I will prepare a PR for this
Some fields are redundant:
{
"id": "temp2",
"name": "temp2",
"links": {
"ai4_template": "ai4-template/2.1.5",
"source_code": "https://github.com/ai4os-hub/temp2",
"docker_image": "ai4oshub/temp2"
},
"Metadata-Version": "2.1",
"Name": "temp2",
"Version": "0.0.1",
"Summary": "Some temp project",
"Author-email": "Borja <borja.est@example.com>",
"License": "MIT",
"Project-URL": "Homepage, https://github.com/ai4os-hub/temp2",
"Classifier": "Intended Audience :: Information Technology",
"Requires-Python": ">=3.8",
"Description-Content-Type": "text/markdown",
"License-File": "LICENSE",
"Requires-Dist": "webargs ~=5.5.3",
"Description": "# temp2\n[![Build Status](https://jenkins.services.ai4os.eu/buildStatus/icon?job=AI4OS-hub/temp2/main)](https://jenkins.services.ai4os.eu/job/AI4OS-hub/job/temp2/job/main/)\n\nSome temp project\n\nTo launch it, first install the package then run [deepaas](https://github.com/ai4os/DEEPaaS):\n```bash\ngit clone https://github.com/ai4os-hub/temp2\ncd temp2\npip install -e .\ndeepaas-run --listen-ip 0.0.0.0\n```\n\n## Project structure\n```\n│\n├── Dockerfile <- Describes main steps on integration of DEEPaaS API and\n│ temp2 application in one Docker image\n│\n├── Jenkinsfile <- Describes basic Jenkins CI/CD pipeline (see .sqa/)\n│\n├── LICENSE <- License file\n│\n├── README.md <- The top-level README for developers using this project.\n│\n├── VERSION <- temp2 version file\n│\n├── .sqa/ <- CI/CD configuration files\n│\n├── temp2 <- Source code for use in this project.\n│ │\n│ ├── __init__.py <- Makes temp2 a Python module\n│ │\n│ ├── api.py <- Main script for the integration with DEEPaaS API\n│ |\n│ ├── config.py <- Configuration file to define Constants used across temp2\n│ │\n│ └── misc.py <- Misc functions that were helpful accross projects\n│\n├── data/ <- Folder to store the data\n│\n├── models/ <- Folder to store models\n│ \n├── tests/ <- Scripts to perfrom code testing\n|\n├── metadata.json <- Metadata information propagated to the AI4OS Hub\n│\n├── pyproject.toml <- a configuration file used by packaging tools, so temp2\n│ can be imported or installed with `pip install -e .` \n│\n├── requirements.txt <- The requirements file for reproducing the analysis environment, i.e.\n│ contains a list of packages needed to make temp2 work\n│\n├── requirements-test.txt <- The requirements file for running code tests (see tests/ directory)\n│\n└── tox.ini <- Configuration file for the tox tool used for testing (see .sqa/)\n```\n",
"Author-emails": {
"Borja": "borja.est@example.com"
},
"Authors": [
"Borja"
],
"metadata_version": "2.0.0",
"title": "temp2",
"summary": "Some temp project",
"description": "temp2 is an application using the DEEPaaS API.\nSome temp project",
"dates": {
"created": "2024-10-01",
"updated": "2024-10-01"
}
}
Here is the list of fields need to rework:
Metadata-Version
is the version for importlib.metadata
, I do not see there is a easy outcome from merging this 2 things together. I propose the following options:
importlib.metadata
and use only metadata from ai4-metadata.yml
package-metadatata
and put the importlib.metadata
there.I vote for 1
metadata "date" is giving problems but I see you are fixing it in the other PR
Left, metadata from "importlib", left metadata from "ai4-metadata"
@BorjaEst , @IgnacioHeredia I think, merging different metadata creates a bigger mess, sorry. pyproject.toml is the python standard and we insist in having it. Possibly: a) we return only ai4-metadata.yml info disadvantages:
b) return exactly the same metadata as before, but add one more field: "ai4-metadata": github_url/ai4-metadata.yml i.e. not the full a4-metadata.yml is returned but the link to it advantages:
ok, I can fix it that way.
This would be the output, please confirm before I merge:
{
"id": "temp1",
"name": "temp1",
"links": [
{
"rel": "self",
"href": "/v2/models/temp1"
}
],
"Metadata-Version": "2.1",
"Name": "temp1",
"Version": "0.0.1",
"Summary": "asdfsdf",
"Author-email": "asdf <asdf@asdf.sdf>",
"License": "MIT",
"Project-URL": "Homepage, https://github.com/ai4os-hub/temp1",
"Classifier": "Intended Audience :: Information Technology",
"Requires-Python": ">=3.8",
"Description-Content-Type": "text/markdown",
"License-File": "LICENSE",
"Requires-Dist": "webargs ~=5.5.3",
"Description": "# temp1\n[![Build Status](https://jenkins.services.ai4os.eu/buildStatus/icon?job=AI4OS-hub/temp1/main)](https://jenkins.services.ai4os.eu/job/AI4OS-hub/job/temp1/job/main/)\n\nasdfsdf\n\nTo launch it, first install the package then run [deepaas](https://github.com/ai4os/DEEPaaS):\n```bash\ngit clone https://github.com/ai4os-hub/temp1\ncd temp1\npip install -e .\ndeepaas-run --listen-ip 0.0.0.0\n```\n\n## Project structure\n```\n│\n├── Dockerfile <- Describes main steps on integration of DEEPaaS API and\n│ temp1 application in one Docker image\n│\n├── Jenkinsfile <- Describes basic Jenkins CI/CD pipeline (see .sqa/)\n│\n├── LICENSE <- License file\n│\n├── README.md <- The top-level README for developers using this project.\n│\n├── VERSION <- temp1 version file\n│\n├── .sqa/ <- CI/CD configuration files\n│\n├── temp1 <- Source code for use in this project.\n│ │\n│ ├── __init__.py <- Makes temp1 a Python module\n│ │\n│ ├── api.py <- Main script for the integration with DEEPaaS API\n│ |\n│ ├── config.py <- Configuration file to define Constants used across temp1\n│ │\n│ └── misc.py <- Misc functions that were helpful accross projects\n│\n├── data/ <- Folder to store the data\n│\n├── models/ <- Folder to store models\n│ \n├── tests/ <- Scripts to perfrom code testing\n|\n├── metadata.json <- Metadata information propagated to the AI4OS Hub\n│\n├── pyproject.toml <- a configuration file used by packaging tools, so temp1\n│ can be imported or installed with `pip install -e .` \n│\n├── requirements.txt <- The requirements file for reproducing the analysis environment, i.e.\n│ contains a list of packages needed to make temp1 work\n│\n├── requirements-test.txt <- The requirements file for running code tests (see tests/ directory)\n│\n└── tox.ini <- Configuration file for the tox tool used for testing (see .sqa/)\n```\n",
"Author-emails": {
"asdf": "asdf@asdf.sdf"
},
"Authors": [
"asdf"
],
"AI4-Metadata": {
"metadata_version": "2.0.0",
"title": "temp1",
"summary": "asdfsdf",
"description": "temp1 is an application using the DEEPaaS API.\nasdfsdf",
"dates": {
"created": "2024-10-09",
"updated": "2024-10-09"
},
"links": {
"ai4_template": "ai4-template/2.1.5",
"source_code": "https://github.com/ai4os-hub/temp1",
"docker_image": "ai4oshub/temp1"
}
}
}
Note the following:
"id": "temp1",
"name": "temp1",
"links": [
{
"rel": "self",
"href": "/v2/models/temp1"
}
],
Are added by DEEPaaS, not by importlib, but I suppose it really does not matter as we are also modifying the authors and emails format.
Some of fields are capital and other are not.
@vykozlov I agree with your points:
That's why initially we discussed including the AI4 metadata in a separate key, just as Borja shows in the previous comment. As a dict, not as a link to Github link. Because otherwise you need internet connection to read the metadata.
If the output is messy it's because our workflow can be messy and the same info can be repeated both in TOML and YAML, sometimes becoming unsynced if you change one but not the other. All in all, I agree with returning both TOML and YAML, only that YAML should be returned as dict, not link.
This would be the output, please confirm before I merge:
Confirmed, from my side.
:exclamation: I am moving discussion to our internal emails :exclamation:
maybe to be clear, I find the approach of including already existing toml|yml files messy. Think of:
The suggestion was to have "current metadata + ai4-metadata link" E.g.
{
"id": "ai4os_demo_app",
"name": "ai4os_demo_app",
"links": [
{
"rel": "self",
"href": "/v2/models/ai4os_demo_app"
}
],
"version": "0.0.1.dev29",
"author": "Max Mustermann",
"author-email": "mustermann@some-email.eu",
"license": "MIT"
"ai4-metadata": https://raw.githubusercontent.com/ai4os-hub/ai4os-demo-app/refs/heads/main/ai4-metadata.yml
}
but maybe even "ai4-metadata.yml" URL is a bit redundant, as just GitHub repo link should be enough
@BorjaEst @vykozlov
Wouldn't it make sense to add the user metadata (
ai4-metadata.yml
) to the functionget_metadata()
in addition to the distribution metadata? To avoid breaking backward compatibility we can return them in a specific field (eg.user metadata
?)