aodn / data-index

GNU General Public License v3.0
0 stars 0 forks source link

Find out how to query for links between collections and files #1

Open barbuz opened 1 week ago

barbuz commented 1 week ago

Once the data index is integrated with ingestion pipelines, each pipeline should know which collection the data they are ingesting is a part of.

For the initial indexing, however, we should rely on some other source of information for which files are in which collections.

It would be useful to find the right queries to:

  1. List all files for a specific collection
  2. Get title and description of a collection
  3. Get the children collections of a specific collection
  4. (If possible) get the main collection a file is a part of

Queries 1 and 2 can then be used to build a first data index MVP over a small set of collections. Adding Queries 3/4 will allow to manage more complex situations where a data file can be part of multiple collections; this will be managed in the STAC catalog by linking the item describing the data to a single main collection (as the collection field only allows a single value), but possibly linking to the same item from multiple collections.

barbuz commented 3 days ago

For the initial PoC we can implement a python function that given an s3 path returns the id of the collection this file is a part of. We will implement this with some simple pattern matching for a few collections that we want to include in the initial index. Building the index will involve listing an s3 prefix and using this function to know for each object which collection is it a part of.