earthpulse / eotdl

Earth Observation Training Datasets
https://eotdl.com
MIT License
17 stars 6 forks source link

Data provenance #121

Open juansensio opened 8 months ago

juansensio commented 8 months ago

Posted by @dmoglioni

USER STORY - notebook/ingestion/timeseries/versioning/data provenance

A user codes a script or notebook that generates a datasets as an output (e.g. timeseries) and then wants to ingest both data and source code (versioned) in EOTDL.

juansensio commented 8 months ago

For Q0 datasets this can already be done.

For Q2+ datasets, we should enable a new item type in the specification which allows linking the source code as assets. @fmariv can take a look at this.

Versioning is automatically supported.

fmariv commented 8 months ago

This sounds interesting. I think we should consider adding this new feature to any STAC object (Catalog, Collection, Items), depending on the case. A new ml-dataset:provenance feature should be added, composed by STAC links objects that point the script or Notebook, both local or URL. We should also consider adding a new mediatype for STAC links, such as code or script or something similar. Use cases: