How to manage "all the other data" in STAC?

agstephens commented 2 years ago

What is "all the other data"?

There will always be the "long tail" of unstructured/uncatalogued data that we need to put into a big bag of "all the other data" in our STAC catalogue.

From now on, we'll call "all the other data" ATOD.

How to manage ATOD?

We need a method of managing "all the other data" that is useful and consistent with our main approach. This prompts the questions:

What is the "collection" for ATOD?
What is an "item" for ATOD?

NOTE: we can assume all "assets" are files in line with the known/managed collections.

Options for ATOD

There are two main options that I can see for managing ATOD:

OPTION 1 - All data is in a "ceda-general" collection:
- each Item is at the level of /badc/<item-id> or /neodc/<item-id>
OPTION 2 - Create a collection for each top-level directory of the form /badc/<collection-id> or /neodc/<collection-id> . Then refine these over time to move up the directory tree. There is 1 item per collection.

After some discussion: we prefer OPTION 1 - it is simple.

How would OPTION 1 work?

For all of ATOD, we do not know the facets. Hence we actually have a single collection description file.

Solution 1: Regex

The description includes a regex to capture all possible directories so that they are indexed. However, we do not know the facet names for each directory so we record them as _dir1, _dir2, ..., dir<n>. The reason we index them is that free-text search will be able to match them even though there will be no specific faceted search for the ATOD collection.

The regular expression would look something like:

^((?P<_dir1>(\w+))/)((?P<_dir2>(\w+))/)*((?P<_dir3>(\w+))/)*((?P<_dir4>(\w+))/)*((?P<_dir5>(\w+))/)*((?P<_dir6>(\w+))/)*((?P<_dir7>(\w+))/)*((?P<_dir8>(\w+))/)*((?P<_dir9>(\w+))/)*((?P<_dir10>(\w+))/)*((?P<_dir11>(\w+))/)*((?P<_dir12>(\w+))/)*((?P<_dir13>(\w+))/)*((?P<_dir14>(\w+))/)*(?P<filename>(\w+))$

This will capture a set of groups, many of which will be None in value and can be ignored and not indexed by the scanning code. In the above pattern, we assume the maximum number of directories in the archive for any given file is 14 - this is just a guess.

Solution 2: extraction method

We could, alternatively, just use a simple extraction method that chopped up the path and created an equivalent dictionary of properties.

We prefer this option.

An example

Let's take an example file:

/badc/deposited2021/adverse_met_scenarios_electricity/data/short_duration/wind_ramping/offshore_south_gb/1_hour_window/1_in_100_years/event2/wind_ramping_offshore_south_gb_1_hour_window_1_in_100_years_event2_windspeed.nc

In this case, we would start with:

collection: deposited2021

After scanning, the asset record would contain:

_dir1: adverse_met_scenarios_electricity
_dir2: data
_dir3: short_duration
_dir4: wind_ramping
_dir5: offshore_south_gb
_dir6: 1_hour_window
_dir7: 1_in_100_years
_dir8: event2
filename: wind_ramping_offshore_south_gb_1_hour_window_1_in_100_years_event2_windspeed.nc

And any free-text search would be able to match contents like "wind_ramping" or "1_hour_window".

The real ambition

The ambition is to gradually reduce the amount of data within ATOD so that it is only really sweeping up the long tail of small datasets. Following OPTION 2 allows us to move towards this goal over time.

agstephens commented 2 years ago

@rhysrevans3 and @Mahir-Sparkess: Please review the above and let me know if it makes sense.

gap736uk commented 2 years ago

@agstephens - could we use MOLES to give you the jumping off point into the archive for each collection? This could provide a more appropriate split point. ... and can also provide some tags to use for search facets given that we have content hand curated at that level that you will have no way of gleaming from any other approach.

e.g. I've done a view in the past to give a rough split of datasets by data-type... (aircraft, model output, ground based obs etc). We also have a list of known paths in the archive that are not catalogued at all and thus would be a good round up of items to be swept into the remaining 'ATOD' bucket(s)

agstephens commented 2 years ago

Thanks @gap736uk, I suggest that we do it this way:

Scan with existing collection descriptions and a giant ATOD - this is really a scalability/performance test and a proof-of-concept that we can scan everything.
Use MOLES, sci-support-team and other bits of domain knowledge to do the next level of creating useful collections.

So, short answer: YES, but after scale-test has worked.

rhysrevans3 commented 2 years ago

I don't think I see a big difference between the two options. Option 2 maybe preferable if the the collections are going to retain these names?

Also, you could make a specific extraction method to extract the terms (possibly using the parts method from pathlib), as this would mean you don't have to guess the number of parts/have multiple none fields.

cedadev / search-futures