cedadev / search-futures

Future Search Architecture
BSD 2-Clause "Simplified" License
0 stars 0 forks source link

Restructure plan for item-descriptions #138

Open agstephens opened 2 years ago

agstephens commented 2 years ago

General concept for restructuring item descriptions

What should we call them?

Better name: Collection Descriptions

Structure of each document

Representing facets in the document hierarchy

We are proposing a new approach:

  1. Asset facets will appear in Assets:

    • facets captured from data (typically files) [0..*]
    • count: largest set of facets
    • facets calculated from other sources [0..*]
    • default facets: set in the configuration [0..*]
  2. Item facets will appear in Items:

    • facets aggregated from Assets [0..*]:
    • default: all Asset facets
    • count: intermediate set of facets
    • excludes: can specify a list of Asset facets to ignore at the Item level
    • facets calculated from other sources [0..*]
    • default facets: set in the configuration [0..*]
  3. Collection facets (properties) will appear in Collection records, and include:

    • facets aggregated from Items [0..*]
    • count: smallest set of facets
    • excludes: can specify a list of Item facets to ignore at the Item level
    • facets calculated from other sources [0..*]
    • default facets: set in the configuration [0..*]

Proposed representing in YAML

paths: /badc/cmip6
collections: cmip6

id_facets:
   - grid_type
   - experiment
   - aircraft_name 

asset:
    defaults: 
        model: nice_model
    facets:
        experiment
        aircraft_name
        height
        flight_path
    extraction_methods: ...
        function:
            source: extractors.flight_tools.get_flight_path
        params:
             resolution: medium
        extracted_facets:
             - aircraft_name
             - flight_path

item:
    defaults: 
        grouping: clan
    facets:
        grid_type
        includes: ....???
        excludes: 
            - height
            - flight_path
    extraction_methods: ...
        function:
            source: extractors.item_tools.get_grid_type
        extracted_facets:
             - grid_type

collection:
rhysrevans3 commented 2 years ago

Could ids, defaults, aggregations etc. become explicit extraction_methods?

rhysrevans3 commented 2 years ago

Example of moving all facet creation into extration_methods:

paths:
  - /badc/cmip6/data/

asset:

  extraction_methods:

    - name: auto
      extracted_facets:
        - id

    - name: defaults
      inputs:
        license: CC-BY-SA-4.0
        general_data_type: climate models
        permitted_use:
          - academic
          - educational
          - commercial
          - policy
          - personal
      extracted_facets:
        - licence
        - general_data_type
        - permitted_use

    - name: regex
      inputs:
        regex: '^(?P<var_id>[^_]+)_(?P<table_id>[^_]+)_(?P<source_id>[^_]+)_(?P<experiment_id>[^_]+)_r(?P<realization_index>\d*)i(?P<initialisation_index>\d*)p(?P<physics_index>\d*)f(?P<forcing_index>\d*)_(?P<grid_label>[^_.]+)'
      pre_processors:
        - name: filename_reducer
      extracted_facets:
        - var_id
        - table_id
        - source_id
        - experiment_id
        - realization_index
        - initialisation_index
        - physics_index
        - forcing_index
        - grid_label

    - name: regex
      inputs:
        regex: '_(?P<start_datetime>\d*)-(?P<end_datetime>\d*)\.'
      pre_processors:
        - name: filename_reducer
      post_processors:
        - name: isodate_processor
          inputs:
            format: '%Y%m'
            date_keys:
              - start_datetime
              - end_datetime
      extracted_facets:
        - start_datetime
        - end_datetime

    - name: regex
      inputs:
        regex: '^\/(?:[^/]*/){3}(?P<mip_era>\w*)\/(?P<activity_id>\w*)\/(?P<institution_id>[\w-]*)\/(?:[^/]*/)'
      extracted_facets:
        - start_datetime
        - end_datetime

    - name: regex
      inputs:
        regex: '^\/(?:[^/]*/){12}v(?P<version>\w*)'
      extracted_facets:
        - start_datetime
        - end_datetime

  post_extraction_methods:

    - name: hash
      inputs:
        terms:
          - mip_era
          - activity_id
          - institution_id
          - table_id
          - source_id
          - var_id
          - version
      extracted_facets: item_id

    - name: vocab
      inputs:
        vocab: cmip6
        strict: False
        terms:
          - var_id
          - table_id
          - source_id
          - experiment_id
          - realization_index
          - initialisation_index
          - physics_index
          - forcing_index
          - grid_label
          - general_data_type
          - permitted_use
      extracted_facets: null

item:

  extraction_methods:

    - name: defaults
      inputs:
        collection_id: cmip6

    - name: elasticsearch_aggregator
      inputs:
        url: elasticsearch.com
        index: assets
        exclude:
          - var_id
      extracted_facets:
        - table_id
        - source_id
        - experiment_id
        - realization_index
        - initialisation_index
        - physics_index
        - forcing_index
        - grid_label
        - general_data_type
        - permitted_use
        - license
        - general_data_type
        - start_datetime
        - end_datetime

  post_extraction_methods:
    - name: hash
    inputs:
      terms:
        - mip_era
        - activity_id
        - institution_id
        - table_id
        - source_id
        - var_id
        - version
    extracted_facets: id

collection:

  extraction_methods:

    - name: defaults
      inputs:
        id: cmip6

    - name: elasticsearch_aggregator
      inputs:
        url: elasticsearch.com
        index: item
        exclude:
          - table_id
      extracted_facets:
        - source_id
        - experiment_id
        - realization_index
        - initialisation_index
        - physics_index
        - forcing_index
        - grid_label
        - general_data_type
        - permitted_use
        - license
        - general_data_type
        - start_datetime
        - end_datetime

Questions:

rhysrevans3 commented 2 years ago

Example with ids extracted:

paths:
  - /badc/cmip6/data/

asset:

  id: 
    method: auto

  extraction_methods:

    - name: defaults
      inputs:
        license: CC-BY-SA-4.0
        general_data_type: climate models
        permitted_use:
          - academic
          - educational
          - commercial
          - policy
          - personal
      extracted_facets:
        - licence
        - general_data_type
        - permitted_use

    - name: regex
      inputs:
        regex: '^(?P<var_id>[^_]+)_(?P<table_id>[^_]+)_(?P<source_id>[^_]+)_(?P<experiment_id>[^_]+)_r(?P<realization_index>\d*)i(?P<initialisation_index>\d*)p(?P<physics_index>\d*)f(?P<forcing_index>\d*)_(?P<grid_label>[^_.]+)'
      pre_processors:
        - name: filename_reducer
      extracted_facets:
        - var_id
        - table_id
        - source_id
        - experiment_id
        - realization_index
        - initialisation_index
        - physics_index
        - forcing_index
        - grid_label

    - name: regex
      inputs:
        regex: '_(?P<start_datetime>\d*)-(?P<end_datetime>\d*)\.'
      pre_processors:
        - name: filename_reducer
      post_processors:
        - name: isodate_processor
          inputs:
            format: '%Y%m'
            date_keys:
              - start_datetime
              - end_datetime
      extracted_facets:
        - start_datetime
        - end_datetime

    - name: regex
      inputs:
        regex: '^\/(?:[^/]*/){3}(?P<mip_era>\w*)\/(?P<activity_id>\w*)\/(?P<institution_id>[\w-]*)\/(?:[^/]*/)'
      extracted_facets:
        - start_datetime
        - end_datetime

    - name: regex
      inputs:
        regex: '^\/(?:[^/]*/){12}v(?P<version>\w*)'
      extracted_facets:
        - start_datetime
        - end_datetime

  post_extraction_methods:

    - name: hash
      inputs:
        terms:
          - mip_era
          - activity_id
          - institution_id
          - table_id
          - source_id
          - var_id
          - version
      extracted_facets: item_id

    - name: vocab
      inputs:
        vocab: cmip6
        strict: False
        terms:
          - var_id
          - table_id
          - source_id
          - experiment_id
          - realization_index
          - initialisation_index
          - physics_index
          - forcing_index
          - grid_label
          - general_data_type
          - permitted_use
      extracted_facets: null

item:

  id: 
    method: hash
    inputs:
      terms:
        - mip_era
        - activity_id
        - institution_id
        - table_id
        - source_id
        - var_id
        - version

  extraction_methods:

    - name: defaults
      inputs:
        collection_id: cmip6

    - name: elasticsearch_aggregator
      inputs:
        url: elasticsearch.com
        index: assets
        exclude:
          - var_id
      extracted_facets:
        - table_id
        - source_id
        - experiment_id
        - realization_index
        - initialisation_index
        - physics_index
        - forcing_index
        - grid_label
        - general_data_type
        - permitted_use
        - license
        - general_data_type
        - start_datetime
        - end_datetime

collection:

  id:
    name: defaults
    inputs: cmip6

  extraction_methods:

    - name: elasticsearch_aggregator
      inputs:
        url: elasticsearch.com
        index: item
        exclude:
          - table_id
      extracted_facets:
        - source_id
        - experiment_id
        - realization_index
        - initialisation_index
        - physics_index
        - forcing_index
        - grid_label
        - general_data_type
        - permitted_use
        - license
        - general_data_type
        - start_datetime
        - end_datetime
rhysrevans3 commented 2 years ago

More notes:

rhysrevans3 commented 2 years ago

Simple example:

paths:
  - /badc/cmip6/data/

asset:

  id: 
    method: auto

  extraction_methods:

    - name: defaults
      inputs:
        license: CC-BY-SA-4.0
        permitted_use:
          - academic
          - educational
      extracted_facets:
        - licence
        - permitted_use

    - name: regex
      inputs:
        regex: '^(?P<var_id>[^_]+)_(?P<table_id>[^_]+)_(?P<source_id>[^_]+)_(?P<experiment_id>[^_]+)'
      pre_processors:
        - name: filename_reducer
      extracted_facets:
        - var_id
        - table_id
        - source_id
        - experiment_id

  post_extraction_methods:

    - name: vocab
      inputs:
        vocab: cmip6
        strict: False
        terms:
          - var_id
          - table_id
          - source_id
          - experiment_id
          - permitted_use
      extracted_facets: null

item:

  id: 
    method: hash
    inputs:
      terms:
        - table_id
        - source_id
        - experiment_id

  extraction_methods:

    - name: elasticsearch_aggregator
      inputs:
        url: elasticsearch.com
        index: assets
        exclude:
          - var_id
      extracted_facets:
        - table_id
        - source_id
        - experiment_id
        - permitted_use
        - license

collection:

  id:
    name: defaults
    inputs: cmip6

  extraction_methods:

    - name: elasticsearch_aggregator
      inputs:
        url: elasticsearch.com
        index: item
      extracted_facets:
        - table_id
        - source_id
        - experiment_id
        - permitted_use
        - license
agstephens commented 2 years ago

@rhysrevans3: that example looks quite nicely structured. We can run it past the team later.