FAIRiCUBE / resource-metadata

manage information for processing/analysis resources, specifically: issue form to collect md requirements, issue template to manage codelists
https://fairicube.github.io/resource-metadata/
0 stars 0 forks source link

Searchable metadata for ML models in rasdaman #9

Open pebau opened 1 year ago

pebau commented 1 year ago

linking in @KathiSchleidt , @sMorrone , @robknapen , @ocampos16

robknapen commented 1 year ago

For the catalogue there is already an initial form to provide ML model information, good to try that and see if it needs changes. I would also think that a user should be able to query rasdaman to get a list of stored models, which maybe returns less extensive metadata, but with appropriate info how to invoke the model. And then both sides should be linked up somehow.

cozzolinoac11 commented 1 year ago

Regarding the use of the form, in case of ML and DL resources:

cozzolinoac11 commented 1 year ago

Regarding 'How would these be provided / harvested into the catalog?', when the form is filled in, the yaml files are automatically created and are stored in the 'yaml-file' folder of the 'resource-metadata' repository. Then these files can be automatically published (with a Python procedure) in the a/p resources catalog. So when EOX creates the catalog (@Schpidi) the files will be automatically published in it.

cozzolinoac11 commented 1 year ago

Concerning 'How can users search for models?' this is something that should also be discussed with EOX (which is responsible for the catalogs). Regarding the search on GitHub, using the label 'a/p resource metadata' from the issue list, it is possible to obtain all the issues concerning a/p resource metadata. However, regarding the catalogue search, looking at the catalog that EOX has made for the data, at the moment one can search by tag and keyword (only one label at a time), but (our) idea is to extend this possibility through the use of queries.

robknapen commented 1 year ago
  • the 'Output data obtained' field is intended to provide information about where the model is stored i.e., it is a pointer to the model via a link

@pebau Would such a (deep) link (URI?) to a model stored in rasdaman be possible? What would be a proper reference to use in the form?

KathiSchleidt commented 1 year ago

On searchability, we definitely need to update the FAIRiCUBE Catalog to enable more complex search. Details under the catalog repo, issue 5.

On links to the model, good question of how to provide this link. Is it safe to say for the present that the links refer to UDFs? Probably needs to be rethought a bit due to adding JupyterHub to rasdaman.

@Schpidi : How would this be done with EOX?

pebau commented 1 year ago

@KathiSchleidt

On links to the model, good question of how to provide this link.

Depends on what should be the answer - the pure identifier (possibly embedded in the corresponding query and URL) would give back the model = byte string.

What would you want to get back when cklicking such a URL?

pebau commented 1 year ago
  • the 'Output data obtained' field is intended to provide information about where the model is stored i.e., it is a pointer to the model via a link

@pebau Would such a (deep) link (URI?) to a model stored in rasdaman be possible? What would be a proper reference to use in the form?

yes, but is it what you want? You would get back the byte string comprising the model.

robknapen commented 1 year ago

That's a good question. With rasdaman specifically currently we would have the option to indeed (A) return the TorchScript data of the model (and someone could load that directly into PyTorch), or (B) refer the user to a template WCPS request with the referenced model already filled in (not sure if that is doable), or (C) provide a webpage describing the model and its intended usage and so on (but that would be kind of a duplicate of the catalogue entry I guess). And maybe there are more alternatives?

Schpidi commented 1 year ago

Concerning 'How can users search for models?' this is something that should also be discussed with EOX (which is responsible for the catalogs). Regarding the search on GitHub, using the label 'a/p resource metadata' from the issue list, it is possible to obtain all the issues concerning a/p resource metadata. However, regarding the catalogue search, looking at the catalog that EOX has made for the data, at the moment one can search by tag and keyword (only one label at a time), but (our) idea is to extend this possibility through the use of queries.

Agreed, the current catalog is fairly limited as it is a static catalog and the search is a pure client side one. Our idea is to harvest the static catalog into a tool providing API support e.g. PyCSW or STAC-FastAPI that can be used in clients.

Schpidi commented 1 year ago

On links to the model, good question of how to provide this link. Is it safe to say for the present that the links refer to UDFs? Probably needs to be rethought a bit due to adding JupyterHub to rasdaman.

@Schpidi : How would this be done with EOX?

If I understand the question correctly we'd suggest to use the model registry provided by MLflow (https://mlflow.org/docs/latest/model-registry.html).

pebau commented 1 year ago

linking in @ocampos16

KathiSchleidt commented 1 year ago

@Schpidi how does this new proposal for mlflow fit with the original proposal from EOX to use STAC for metadata?

Also, is there a better overview of the MLFlow concepts? Before we change anything, we should assure that it aligns with the requirements in D4.3 processing resource metadata

Schpidi commented 1 year ago

sorry @KathiSchleidt, MLflow is just some tooling that is offered. there is no change in using STAC

KathiSchleidt commented 1 year ago

@Schpidi speaking of tooling, what's the outlook on human-readable catalog?

Schpidi commented 1 year ago

🤔 what do you mean by human-readable catalog?

KathiSchleidt commented 1 year ago

@Schpidi Related to Catalog Search Options FAIRiCUBE/FAIRiCUBE-Hub-issue-tracker#10, in addition to exposing the STAC metadata via an API, we need a catalog enabling human users to search the catalog.

As STAC is for Spatio-Temporal Assets, and Spatio-Temporal query tends to require provision of intervals, we need more than the current mini-catalog with one search field.