Closed rbavery closed 6 months ago
I think it makes sense to have it for STAC Items as well as Collections.
The use case for STAC Items is, provided an output image with classification results, that image's metadata could be uploaded to STAC API with a definition of the STAC ML Model that produced it for data lineage. This somewhat overlaps with what the processing extension does, but adds a lot of lacking details (its expression object is not sufficient to represent a whole model inference pipeline). An alternative could also be to add an entry to links
with derived_from
pointing at a shared STAC ML-model definition, but this makes it slightly more complicated/ambiguous to interpret when links
already includes derived_from
to point at the original input image.
Collection normally does a summaries
aggregation of all the variations of the items it contains, but each Item should have the definition on its own. The only extra fields I can see other than summary/aggregates that ML-model could add is details about a set of train/valid/test STAC collections used to generate it, which does not make much sense for individual STAC Items.
I don't think processing should be removed. There are cases where only a simple arithmetic expression to combine bands or perform simple pixel-wise manipulations are applied. Using ML-model for such cases would be overkill and too verbose in comparison to the simple format it offers.
I agree that adding more examples would be the most useful. Otherwise, everything remains very convoluted with the amount of extensions combined for this kind of use with ML Model.
An alternative could also be to add an entry to links with derived_from pointing at a shared STAC ML-model definition, but this makes it slightly more complicated/ambiguous to interpret when links already includes derived_from to point at the original input image.
I like this idea! The scenario I'm thinking of is when a user has not one STAC item but 100,000s of STAC Items. I think it's more preferably to not duplicate this relatively complex metadata across many items. Doing so may unnecessarily raise storage and search costs for large collections of items and make viewing an individual STAC Item more cluttered.
Maybe we could standardize that the model metadata filename item.json
should be listed as a derived_from
link and be given a name like resnet18.json
? That way there is no ambiguity that about which is the ML Model item json. This wouldn't guarantee the name to be unique, but I think it is not common that a STAC Item will be associated with multiple models and if so, these item json files can be named differently by the publisher.
The only extra fields I can see other than summary/aggregates that ML-model could add is details about a set of train/valid/test STAC collections used to generate it, which does not make much sense for individual STAC Items.
Since it pertains to model training and validation, do you think the current plan should be to describe the splits in the ML AOI extension? And maybe eventually this is suggested for ML-model model extension to include at the Collection level?
I don't think processing should be removed. There are cases where only a simple arithmetic expression to combine bands or perform simple pixel-wise manipulations are applied. Using ML-model for such cases would be overkill and too verbose in comparison to the simple format it offers.
Agreed that this isn't needed or doesn't need to be recommended for simple pixel manipulations. My concern is that recommending it at all might be confusing for folks looking to describe the minimal amount of metadata needed to discover and run an ML model.
I agree that adding more examples would be the most useful. Otherwise, everything remains very convoluted with the amount of extensions combined for this kind of use with ML Model.
This was my concern with including the processing extension as the top field in the object table for Data Object. But I'm fine with recommending it. I'd like to call out in each Object table what fields are required vs recommended vs optional/situational, and reorder them so required fields are at the top, does that sound helpful?
The way the metadata is stored can be optimized to avoid duplication between items shared by a collection. This is an implementation detail in my opinion. Each STAC Item should report the information from the API individually because doing a STAC search might yield only partial or overlapping results. The STAC Items returned might not all be from the same collection, or might be accessed without going through the collection first.
I think it is a good idea to suggest providing the derived link as a best practice. I'm not sure if it should be enforced though, since there is not a clear way to distinguish the model's link from any other derived_from
link that could already be there. I don't think there would be multiple models either, but one could provide the original image as another derived_from
link. Therefore, there must be a clearer field (roles: [model]
?) to indicate it specifically if we want to make it a requirement.
Since it pertains to model training and validation, do you think the current plan should be to describe the splits in the ML AOI extension? And maybe eventually this is suggested for ML-model model extension to include at the Collection level?
Yes. I had something like that in mind. I'm already using ml-aoi:split
for my 3 train/validate/test collections. A STAC Item describing a model with mlm
would have to indicate the ml-aoi:split
collection relevant for its training. However, I wouldn't make this a requirement, since some models could have been trained by external data not represented in STAC.
My concern is that recommending it at all might be confusing for folks looking to describe the minimal amount of metadata needed to discover and run an ML model.
The only fields I thought were interesting was processing:lineage
and processing:level = L4
which states:
processing:lineage | string | Lineage Information provided as free text information about the how observations were processed or models that were used to create the resource being described NASA ISO. For example, GRD Post Processing for "GRD" product of Sentinel-1 satellites. CommonMark 0.29 syntax MAY be used for rich text representation. -- | -- | -- and L4 | Model output or results from analyses of lower level data (i.e.,variables that are not directly measured by the instruments, but are derived from these measurements) -- | --
The others can be ignored since they are optional.
I think it is a good idea to suggest providing the derived link as a best practice. I'm not sure if it should be enforced though, since there is not a clear way to distinguish the model's link from any other derived_from link that could already be there. I don't think there would be multiple models either, but one could provide the original image as another derived_from link. Therefore, there must be a clearer field (roles: [model] ?) to indicate it specifically if we want to make it a requirement.
Good point on the role suggestion. I think introducing something like a new role makes sense here and metadata providers would appreciate a clear requirement. In this PR I introduced a new role for referencing geoparquet and I think we could do something similar here to introduce an ml-model role. https://github.com/stac-utils/stac-geoparquet/blob/7cac0b08c06bff8773a49f7d4dd420ea777d965a/spec/stac-geoparquet-spec.md#referencing-a-stac-geoparquet-collections-in-a-stac-collection-json
however I don't think this extension should be defining an Asset Object which has a role field, since Asset Object Fields are not searchable.
Instead, if I'm reading this right, we would use a link object and define a new media type? I'm uncertain if that's the right way to reference this though.
I'll make a separate issue for the processing discussion, thanks for your comments!
I think the Asset Object could be sufficient even if it is not searchable. I think the main purpose is to provide data lineage such that one can understand where model predictions and derived data comes from. By accessing these references, it is then possible to reconstruct a pipeline of derived products.
I don't think it would be a common use case of someone trying to use this link to search for all derived data from a given model. However, I'm not against adding more metadata either. A standardized relation type would need to be defined to avoid conflicts with other extensions.
I think the Asset Object could be sufficient even if it is not searchable. I think the main purpose is to provide data lineage such that one can understand where model predictions and derived data comes from.
I agree that's the main purpose, but there should probably be some flat fields that folks will use to search. I could see folks wanting to search on the enums for tasks and accelerators defined in #2
I don't think it would be a common use case of someone trying to use this link to search for all derived data from a given model.
I agree here it wouldn't be common but I could see it. I was thinking it would be more common to search for the source data given someone has the STAC model extension metadata. Or to find an ML model given some source data, which would be common in scenarios where a STAC dataset is published specifically as an ML training dataset.
Should we have two relation types? One could be for referring to source data from the model json and the other could be for referring to the model from the source dataset json.
For the model referring to the source, we could use the existing via
rel type from [IANA](Identifies a resource that is the source of the information in the link's context.) Or invent another, source-dataset
. I'm not sure what the implications are of adding relation types that are not IANA approved.
Identifies a resource that is the source of the information in the link's context.
For the source referring to the model (in the case of STAC training datasets) we could invent ml-model
Should we have two relation types? One could be for referring to source data from the model json and the other could be for referring to the model from the source dataset json.
Yes. Those definitely need to be distinguished. I believe derived_from
is ideal for the original image:
URL to a STAC Item that was used as input data in the creation of this Item. https://github.com/radiantearth/stac-spec/blob/master/item-spec/item-spec.md#relation-types
For the model, I think simply ml-model
is more appropriate. If it is listed in the best practices, this is even better. It does not need to be an official IANA link relation. If we want to make it more standard, it could also be something like rel: "https://stac-extensions.github.io/ml-model/v0.2.0alpha/schema.json#defs/ml-model"
or some other similar naming authority reference, but I think ml-model
by itself is more in line with what other STAC extensions use.
If anyone has time and thoughts on how to catalog ML models, I have a wip rework of the DLM extension.
The PR: https://github.com/crim-ca/dlm-extension/pull/2
The new README detailing the schema: https://hackmd.io/@cHP95b4sTDWQdP7uy1Vv7A/rkneCaru6
My main questions right now are: Should this extension only exist for the collection level?
My take is yes since the ML AOI extension could handle specifying the specific train/val/test splits used to create a model. This inference focused extension could generally refer to the dataset/collection used with the model once, and reduce the redundancy of duplicating ML model information for each item representing a scene in a STAC Collection.
Is it ok to remove the processing extension? My thinking here is yes, this is something that can be included in the collection json, but it doesn't need to be a requirement for the ML Model spec. Maybe we could offer examples of pairing essential ML Model extension fields for search and inference with fields from other extensions like processing that are more specific to the dataset qualities and might be useful for an ML practitioner to know.