Open barbuz opened 1 week ago
For the initial PoC we can implement a python function that given an s3 path returns the id of the collection this file is a part of. We will implement this with some simple pattern matching for a few collections that we want to include in the initial index. Building the index will involve listing an s3 prefix and using this function to know for each object which collection is it a part of.
Once the data index is integrated with ingestion pipelines, each pipeline should know which collection the data they are ingesting is a part of.
For the initial indexing, however, we should rely on some other source of information for which files are in which collections.
It would be useful to find the right queries to:
Queries 1 and 2 can then be used to build a first data index MVP over a small set of collections. Adding Queries 3/4 will allow to manage more complex situations where a data file can be part of multiple collections; this will be managed in the STAC catalog by linking the item describing the data to a single main collection (as the
collection
field only allows a single value), but possibly linking to the same item from multiple collections.