Fully describe the plan for searching and indexing with name-spaced facets (vocabs)

rsmith013 commented 2 years ago

@rhys has added a post-processor. It will now be figuring out how those prefixes propagate through the system to make sure they get aggregated up.

It presents some interesting challenges. Namespacing using: <namespace>:facet_name means that you will need to apply that namespace to the aggregation and search facets as well in order for them to be aggregated. The other source of namespaces is the vocab. E.g cmip6:source_id is also general:model. Applying this to an asset, either you have to query the vocab server again or you need another way to know that cmip6:source_id is also general:model

One suggestion would be creating the namespaces in this way (this is how it is done in STAC JSON e.g. eo) but then, in the elasticsearch output plugin, you break this down into:

{
"properties": 
    "model":  {
         "values": ["ACCESS-ESM1-5"],
         "namespace": "general"
  },
    "source_id":  {
         "values": ["ACCESS-ESM1-5"],
         "namespace": "cmip6"
  }
}

Maybe this would be easier? It would also allow no-namespace searching.

agstephens commented 2 years ago

@rhysrevans3 will put together a diagram to explain all the interactions with Controlled Vocabs - so that we have all the functional interfaces defined and we can agree on how the system should behave in each case.

rhysrevans3 commented 2 years ago

Vocab look up at Generator stage:

diagram

agstephens commented 2 years ago

What are our search Use Cases?

search("ukesm") OR search("UKESM") OR search("ukes*"):
1. free-text search will be slower because it will check all fields/facets
2. but it will work
3. NOTE: search is case-insensitive
4. NOTE: wild-card "*" should work (or we restrict/ignore it)
search("model=ukesm") OR search("source_id=ukesm"):
1. search ALL namespaces for:
  1. key: "model"
  2. value: "ukesm"
2. sort results (if possible):
  1. with "general" namespace at the top (ideally)
search("cmip6:source_id=ukesm") OR search("general:model=ukesm"):
1. search specific namespace "cmip6" for:
  - key: "model"
  - value: "ukesm"
search("cmip6:source_id=ukesm AND general:variable=air_temperature"):
- Build a combined search expression where these conditions are BOTH TRUE:
  - search specific namespace "cmip6" for:
    - key: "model"
    - value: "ukesm"
  - search specific namespace "general" for:
    - key: "variable"
    - value: "air_temperature"
search("cmip6:source_id=ukesm AND variable=air_temperature"):
- Build a combined search expression where these conditions are BOTH TRUE:
  - search specific namespace "cmip6" for:
    - key: "source_id"
    - value: "ukesm"
  - search ALL namespaces for:
    - key: "variable"
    - value: "air_temperature"
    - sort results (if possible):
      - with "general" namespace at the top (ideally)

agstephens commented 2 years ago

One caveat: if any part of a vocab changes...

we need to:
1. Re-index all those records, OR,...
2. Create an update tool

agstephens commented 2 years ago

From the STAC point of view, we like this:

{
"properties": 
    "source_id":  "ACCESS-ESM1-5"
  }
}

...but we accept that this might be the reality...

{
"properties": 
    "source_id":  ["ACCESS-ESM1-5"]
  }
}

agstephens commented 2 years ago

In terms of the Elasticsearch content, here are some candidate representations:

1. Dictionary per facet that includes namespace:

looks sensible
value is duplicated
would allow simple search using mapped or namespace-specific search terms

{
"properties": 
    "model":  {
         "values": ["ACCESS-ESM1-5"],
         "namespace": "general"
  },
    "source_id":  {
         "values": ["ACCESS-ESM1-5"],
         "namespace": "cmip6"
  }
}

2. Dictionary per facet that includes namespace:

more compact
would require a more complex ES query to separate out the namespace

{
"properties": 
    "source_id":  {
         "values": ["ACCESS-ESM1-5"],
         "namespace": "cmip6",
         "general_term": "model"
  }
}

3. Two separate dictionaries of properties - (1) general, (2) specific to namespace

STAC would only be fed the namespace-specific properties

{
"properties": {
    "source_id": "ACCESS-ESM1-5",
    "namespace": "cmip6"
  },
"general_properties": {
    "model": "ACCESS-ESM1-5",
  }
}

agstephens commented 2 years ago

@rhysrevans3, have a play with the above, maybe (1) and (3) look like the best candidates. Might (3) be fastest?

rhysrevans3 commented 2 years ago

Separate dictionaries nested within properties:
- benefit of two separate dictionaries
- allows for more than one vocab namespace to be used
- still have value duplication

{
"properties": {
    "cmip6”: {
        ”source_id": "ACCESS-ESM1-5",
    },
    "general": {
        "model": "ACCESS-ESM1-5",
    }
}

rhysrevans3 commented 2 years ago

Could we need multiple vocabs for a single item? How do we deal with CEDA specific facets?

agstephens commented 2 years ago

There might also be content indexed/queried that is NOT in any vocabulary (yes, we allow that):

E.g.:

{
"properties": 
    "nueron_profile":  {
         "values": ["salient"]
  }
}

rhysrevans3 commented 2 years ago

From my query testing:

Maybe the easiest to index? Can complete all search use cases but requires AND/OR more vocab search.
Becomes very complex for cross vocab facet search.
Becomes complex when content not in any vocab or multiple vocabs are needed.
Easiest to search. Would require an extra dummy vocab no_vocab for content not in vocabs:

{
"properties": {
    "cmip6”: {
        ”source_id": "ACCESS-ESM1-5"
    },
    "general": {
        "model": "ACCESS-ESM1-5"
    },
    "no_vocab": {
        "nueron_profile": "salient"
    }
}

agstephens commented 2 years ago

vocab-unspecified

rhysrevans3 commented 2 years ago

Example:

{
"properties": {
    "cmip6": {
        "source_id": "ACCESS-ESM1-5",
        "source_type": "AOGCM"
    },
    "general": {
        "model": "ACCESS-ESM1-5",
        "model_type": "AOGCM"
    },
    "vocab-unspecified": {
        "nueron_profile": "salient",
        "ice_cream_flavour": "rocky road"
    }
}

agstephens commented 2 years ago

Here's a whole big example:

{
    "properties": {
        "cmip6": {
            "mip_era": "CMIP6",
            "activity_id": "HighResMIP",
            "institution_id": "MOHC",
            "source_id": "HadGEM3-GC31-HH",
            "experiment_id": "hist-1950",
            "member_id": "r1i1p1f1",
            "table_id": "Amon",
            "variable_id": "tas",
            "grid_label": "gn",
            "version": "v20180418"
        },
        "general": {
            "institute": "MOHC",
            "model": "HadGEM3-GC31-HH",
            "experiment": "hist-1950",
            "ensemble_member": "r1i1p1f1",
            "frequency": "Amon"
        },
        "vocab-unspecified": {
            "nominal_resolution": "10 km",
            "parent_activity_id": "HighResMIP",
            "parent_experiment_id": "spinup-1950"
        }
    }
}

agstephens commented 1 year ago

Reopening as reference only.

agstephens commented 4 months ago

Some useful feedback from our friends at the Met Office:

the terms in a CV should also have a unique identifier that is stored somewhere in the STAC record:
- to act as a lookup for deeper metadata
- to ensure long-term consistency - even if the actual text term changes

cedadev / search-futures

Fully describe the plan for searching and indexing with name-spaced facets (vocabs) #107

Vocab look up at Generator stage: