Restructure plan for item-descriptions

agstephens commented 2 years ago

General concept for restructuring item descriptions

What should we call them?

Better name: Collection Descriptions

Structure of each document

Representing facets in the document hierarchy

We are proposing a new approach:

Asset facets will appear in Assets:
- facets captured from data (typically files) [0..*]
- count: largest set of facets
- facets calculated from other sources [0..*]
- default facets: set in the configuration [0..*]
Item facets will appear in Items:
- facets aggregated from Assets [0..*]:
- default: all Asset facets
- count: intermediate set of facets
- excludes: can specify a list of Asset facets to ignore at the Item level
- facets calculated from other sources [0..*]
- default facets: set in the configuration [0..*]
Collection facets (properties) will appear in Collection records, and include:
- facets aggregated from Items [0..*]
- count: smallest set of facets
- excludes: can specify a list of Item facets to ignore at the Item level
- facets calculated from other sources [0..*]
- default facets: set in the configuration [0..*]

Proposed representing in YAML

paths: /badc/cmip6
collections: cmip6

id_facets:
   - grid_type
   - experiment
   - aircraft_name 

asset:
    defaults: 
        model: nice_model
    facets:
        experiment
        aircraft_name
        height
        flight_path
    extraction_methods: ...
        function:
            source: extractors.flight_tools.get_flight_path
        params:
             resolution: medium
        extracted_facets:
             - aircraft_name
             - flight_path

item:
    defaults: 
        grouping: clan
    facets:
        grid_type
        includes: ....???
        excludes: 
            - height
            - flight_path
    extraction_methods: ...
        function:
            source: extractors.item_tools.get_grid_type
        extracted_facets:
             - grid_type

collection:

rhysrevans3 commented 2 years ago

Could ids, defaults, aggregations etc. become explicit extraction_methods?

rhysrevans3 commented 2 years ago

Example of moving all facet creation into extration_methods:

paths:
  - /badc/cmip6/data/

asset:

  extraction_methods:

    - name: auto
      extracted_facets:
        - id

    - name: defaults
      inputs:
        license: CC-BY-SA-4.0
        general_data_type: climate models
        permitted_use:
          - academic
          - educational
          - commercial
          - policy
          - personal
      extracted_facets:
        - licence
        - general_data_type
        - permitted_use

    - name: regex
      inputs:
        regex: '^(?P<var_id>[^_]+)_(?P<table_id>[^_]+)_(?P<source_id>[^_]+)_(?P<experiment_id>[^_]+)_r(?P<realization_index>\d*)i(?P<initialisation_index>\d*)p(?P<physics_index>\d*)f(?P<forcing_index>\d*)_(?P<grid_label>[^_.]+)'
      pre_processors:
        - name: filename_reducer
      extracted_facets:
        - var_id
        - table_id
        - source_id
        - experiment_id
        - realization_index
        - initialisation_index
        - physics_index
        - forcing_index
        - grid_label

    - name: regex
      inputs:
        regex: '_(?P<start_datetime>\d*)-(?P<end_datetime>\d*)\.'
      pre_processors:
        - name: filename_reducer
      post_processors:
        - name: isodate_processor
          inputs:
            format: '%Y%m'
            date_keys:
              - start_datetime
              - end_datetime
      extracted_facets:
        - start_datetime
        - end_datetime

    - name: regex
      inputs:
        regex: '^\/(?:[^/]*/){3}(?P<mip_era>\w*)\/(?P<activity_id>\w*)\/(?P<institution_id>[\w-]*)\/(?:[^/]*/)'
      extracted_facets:
        - start_datetime
        - end_datetime

    - name: regex
      inputs:
        regex: '^\/(?:[^/]*/){12}v(?P<version>\w*)'
      extracted_facets:
        - start_datetime
        - end_datetime

  post_extraction_methods:

    - name: hash
      inputs:
        terms:
          - mip_era
          - activity_id
          - institution_id
          - table_id
          - source_id
          - var_id
          - version
      extracted_facets: item_id

    - name: vocab
      inputs:
        vocab: cmip6
        strict: False
        terms:
          - var_id
          - table_id
          - source_id
          - experiment_id
          - realization_index
          - initialisation_index
          - physics_index
          - forcing_index
          - grid_label
          - general_data_type
          - permitted_use
      extracted_facets: null

item:

  extraction_methods:

    - name: defaults
      inputs:
        collection_id: cmip6

    - name: elasticsearch_aggregator
      inputs:
        url: elasticsearch.com
        index: assets
        exclude:
          - var_id
      extracted_facets:
        - table_id
        - source_id
        - experiment_id
        - realization_index
        - initialisation_index
        - physics_index
        - forcing_index
        - grid_label
        - general_data_type
        - permitted_use
        - license
        - general_data_type
        - start_datetime
        - end_datetime

  post_extraction_methods:
    - name: hash
    inputs:
      terms:
        - mip_era
        - activity_id
        - institution_id
        - table_id
        - source_id
        - var_id
        - version
    extracted_facets: id

collection:

  extraction_methods:

    - name: defaults
      inputs:
        id: cmip6

    - name: elasticsearch_aggregator
      inputs:
        url: elasticsearch.com
        index: item
        exclude:
          - table_id
      extracted_facets:
        - source_id
        - experiment_id
        - realization_index
        - initialisation_index
        - physics_index
        - forcing_index
        - grid_label
        - general_data_type
        - permitted_use
        - license
        - general_data_type
        - start_datetime
        - end_datetime

Questions:

Should id for item and collection also be extracted in the elasticsearch_aggregator?
Should id be kept separate from extraction_methods?
The extracted_facets for the elasticsearch_aggregator is effectively an include does this negate the use of exclude?
Is extraction_methods still the right name?
Can ids be inferred across asset/item/collection to stop duplication? Or is it better to be explicit?

rhysrevans3 commented 2 years ago

Example with ids extracted:

paths:
  - /badc/cmip6/data/

asset:

  id: 
    method: auto

  extraction_methods:

    - name: defaults
      inputs:
        license: CC-BY-SA-4.0
        general_data_type: climate models
        permitted_use:
          - academic
          - educational
          - commercial
          - policy
          - personal
      extracted_facets:
        - licence
        - general_data_type
        - permitted_use

    - name: regex
      inputs:
        regex: '^(?P<var_id>[^_]+)_(?P<table_id>[^_]+)_(?P<source_id>[^_]+)_(?P<experiment_id>[^_]+)_r(?P<realization_index>\d*)i(?P<initialisation_index>\d*)p(?P<physics_index>\d*)f(?P<forcing_index>\d*)_(?P<grid_label>[^_.]+)'
      pre_processors:
        - name: filename_reducer
      extracted_facets:
        - var_id
        - table_id
        - source_id
        - experiment_id
        - realization_index
        - initialisation_index
        - physics_index
        - forcing_index
        - grid_label

    - name: regex
      inputs:
        regex: '_(?P<start_datetime>\d*)-(?P<end_datetime>\d*)\.'
      pre_processors:
        - name: filename_reducer
      post_processors:
        - name: isodate_processor
          inputs:
            format: '%Y%m'
            date_keys:
              - start_datetime
              - end_datetime
      extracted_facets:
        - start_datetime
        - end_datetime

    - name: regex
      inputs:
        regex: '^\/(?:[^/]*/){3}(?P<mip_era>\w*)\/(?P<activity_id>\w*)\/(?P<institution_id>[\w-]*)\/(?:[^/]*/)'
      extracted_facets:
        - start_datetime
        - end_datetime

    - name: regex
      inputs:
        regex: '^\/(?:[^/]*/){12}v(?P<version>\w*)'
      extracted_facets:
        - start_datetime
        - end_datetime

  post_extraction_methods:

    - name: hash
      inputs:
        terms:
          - mip_era
          - activity_id
          - institution_id
          - table_id
          - source_id
          - var_id
          - version
      extracted_facets: item_id

    - name: vocab
      inputs:
        vocab: cmip6
        strict: False
        terms:
          - var_id
          - table_id
          - source_id
          - experiment_id
          - realization_index
          - initialisation_index
          - physics_index
          - forcing_index
          - grid_label
          - general_data_type
          - permitted_use
      extracted_facets: null

item:

  id: 
    method: hash
    inputs:
      terms:
        - mip_era
        - activity_id
        - institution_id
        - table_id
        - source_id
        - var_id
        - version

  extraction_methods:

    - name: defaults
      inputs:
        collection_id: cmip6

    - name: elasticsearch_aggregator
      inputs:
        url: elasticsearch.com
        index: assets
        exclude:
          - var_id
      extracted_facets:
        - table_id
        - source_id
        - experiment_id
        - realization_index
        - initialisation_index
        - physics_index
        - forcing_index
        - grid_label
        - general_data_type
        - permitted_use
        - license
        - general_data_type
        - start_datetime
        - end_datetime

collection:

  id:
    name: defaults
    inputs: cmip6

  extraction_methods:

    - name: elasticsearch_aggregator
      inputs:
        url: elasticsearch.com
        index: item
        exclude:
          - table_id
      extracted_facets:
        - source_id
        - experiment_id
        - realization_index
        - initialisation_index
        - physics_index
        - forcing_index
        - grid_label
        - general_data_type
        - permitted_use
        - license
        - general_data_type
        - start_datetime
        - end_datetime

rhysrevans3 commented 2 years ago

More notes:

How to handle defaults:
- specify at each level.
- Include in elasticsearch agregator.
- Have separate method (elasticsearch collector?).
Do we need extracted_facets
- could the extraction methods have a function to give a list of the facets it will extract.

rhysrevans3 commented 2 years ago

Simple example:

paths:
  - /badc/cmip6/data/

asset:

  id: 
    method: auto

  extraction_methods:

    - name: defaults
      inputs:
        license: CC-BY-SA-4.0
        permitted_use:
          - academic
          - educational
      extracted_facets:
        - licence
        - permitted_use

    - name: regex
      inputs:
        regex: '^(?P<var_id>[^_]+)_(?P<table_id>[^_]+)_(?P<source_id>[^_]+)_(?P<experiment_id>[^_]+)'
      pre_processors:
        - name: filename_reducer
      extracted_facets:
        - var_id
        - table_id
        - source_id
        - experiment_id

  post_extraction_methods:

    - name: vocab
      inputs:
        vocab: cmip6
        strict: False
        terms:
          - var_id
          - table_id
          - source_id
          - experiment_id
          - permitted_use
      extracted_facets: null

item:

  id: 
    method: hash
    inputs:
      terms:
        - table_id
        - source_id
        - experiment_id

  extraction_methods:

    - name: elasticsearch_aggregator
      inputs:
        url: elasticsearch.com
        index: assets
        exclude:
          - var_id
      extracted_facets:
        - table_id
        - source_id
        - experiment_id
        - permitted_use
        - license

collection:

  id:
    name: defaults
    inputs: cmip6

  extraction_methods:

    - name: elasticsearch_aggregator
      inputs:
        url: elasticsearch.com
        index: item
      extracted_facets:
        - table_id
        - source_id
        - experiment_id
        - permitted_use
        - license

agstephens commented 2 years ago

@rhysrevans3: that example looks quite nicely structured. We can run it past the team later.

cedadev / search-futures