Open rsmith013 opened 2 years ago
@rhysrevans3 will put together a diagram to explain all the interactions with Controlled Vocabs - so that we have all the functional interfaces defined and we can agree on how the system should behave in each case.
What are our search Use Cases?
search("ukesm")
OR search("UKESM") OR search("ukes*")
:
search("model=ukesm")
OR search("source_id=ukesm")
:
search("cmip6:source_id=ukesm")
OR search("general:model=ukesm")
:
search("cmip6:source_id=ukesm AND general:variable=air_temperature")
:
search("cmip6:source_id=ukesm AND variable=air_temperature")
:
One caveat: if any part of a vocab changes...
From the STAC point of view, we like this:
{
"properties":
"source_id": "ACCESS-ESM1-5"
}
}
...but we accept that this might be the reality...
{
"properties":
"source_id": ["ACCESS-ESM1-5"]
}
}
In terms of the Elasticsearch content, here are some candidate representations:
1. Dictionary per facet that includes namespace
:
{
"properties":
"model": {
"values": ["ACCESS-ESM1-5"],
"namespace": "general"
},
"source_id": {
"values": ["ACCESS-ESM1-5"],
"namespace": "cmip6"
}
}
2. Dictionary per facet that includes namespace
:
{
"properties":
"source_id": {
"values": ["ACCESS-ESM1-5"],
"namespace": "cmip6",
"general_term": "model"
}
}
3. Two separate dictionaries of properties - (1) general, (2) specific to namespace
{
"properties": {
"source_id": "ACCESS-ESM1-5",
"namespace": "cmip6"
},
"general_properties": {
"model": "ACCESS-ESM1-5",
}
}
@rhysrevans3, have a play with the above, maybe (1) and (3) look like the best candidates. Might (3) be fastest?
{
"properties": {
"cmip6”: {
”source_id": "ACCESS-ESM1-5",
},
"general": {
"model": "ACCESS-ESM1-5",
}
}
Could we need multiple vocabs for a single item? How do we deal with CEDA specific facets?
There might also be content indexed/queried that is NOT in any vocabulary (yes, we allow that):
E.g.:
{
"properties":
"nueron_profile": {
"values": ["salient"]
}
}
From my query testing:
AND/OR
more vocab search.no_vocab
for content not in vocabs:{
"properties": {
"cmip6”: {
”source_id": "ACCESS-ESM1-5"
},
"general": {
"model": "ACCESS-ESM1-5"
},
"no_vocab": {
"nueron_profile": "salient"
}
}
vocab-unspecified
Example:
{
"properties": {
"cmip6": {
"source_id": "ACCESS-ESM1-5",
"source_type": "AOGCM"
},
"general": {
"model": "ACCESS-ESM1-5",
"model_type": "AOGCM"
},
"vocab-unspecified": {
"nueron_profile": "salient",
"ice_cream_flavour": "rocky road"
}
}
Here's a whole big example:
{
"properties": {
"cmip6": {
"mip_era": "CMIP6",
"activity_id": "HighResMIP",
"institution_id": "MOHC",
"source_id": "HadGEM3-GC31-HH",
"experiment_id": "hist-1950",
"member_id": "r1i1p1f1",
"table_id": "Amon",
"variable_id": "tas",
"grid_label": "gn",
"version": "v20180418"
},
"general": {
"institute": "MOHC",
"model": "HadGEM3-GC31-HH",
"experiment": "hist-1950",
"ensemble_member": "r1i1p1f1",
"frequency": "Amon"
},
"vocab-unspecified": {
"nominal_resolution": "10 km",
"parent_activity_id": "HighResMIP",
"parent_experiment_id": "spinup-1950"
}
}
}
Reopening as reference only.
Some useful feedback from our friends at the Met Office:
@rhys has added a post-processor. It will now be figuring out how those prefixes propagate through the system to make sure they get aggregated up.
It presents some interesting challenges. Namespacing using:
<namespace>:facet_name
means that you will need to apply that namespace to the aggregation and search facets as well in order for them to be aggregated. The other source of namespaces is the vocab. E.gcmip6:source_id
is alsogeneral:model
. Applying this to an asset, either you have to query the vocab server again or you need another way to know thatcmip6:source_id
is alsogeneral:model
One suggestion would be creating the namespaces in this way (this is how it is done in STAC JSON e.g. eo) but then, in the elasticsearch output plugin, you break this down into:
Maybe this would be easier? It would also allow no-namespace searching.