Open agstephens opened 2 years ago
Could ids, defaults, aggregations etc. become explicit extraction_methods?
Example of moving all facet creation into extration_methods
:
paths:
- /badc/cmip6/data/
asset:
extraction_methods:
- name: auto
extracted_facets:
- id
- name: defaults
inputs:
license: CC-BY-SA-4.0
general_data_type: climate models
permitted_use:
- academic
- educational
- commercial
- policy
- personal
extracted_facets:
- licence
- general_data_type
- permitted_use
- name: regex
inputs:
regex: '^(?P<var_id>[^_]+)_(?P<table_id>[^_]+)_(?P<source_id>[^_]+)_(?P<experiment_id>[^_]+)_r(?P<realization_index>\d*)i(?P<initialisation_index>\d*)p(?P<physics_index>\d*)f(?P<forcing_index>\d*)_(?P<grid_label>[^_.]+)'
pre_processors:
- name: filename_reducer
extracted_facets:
- var_id
- table_id
- source_id
- experiment_id
- realization_index
- initialisation_index
- physics_index
- forcing_index
- grid_label
- name: regex
inputs:
regex: '_(?P<start_datetime>\d*)-(?P<end_datetime>\d*)\.'
pre_processors:
- name: filename_reducer
post_processors:
- name: isodate_processor
inputs:
format: '%Y%m'
date_keys:
- start_datetime
- end_datetime
extracted_facets:
- start_datetime
- end_datetime
- name: regex
inputs:
regex: '^\/(?:[^/]*/){3}(?P<mip_era>\w*)\/(?P<activity_id>\w*)\/(?P<institution_id>[\w-]*)\/(?:[^/]*/)'
extracted_facets:
- start_datetime
- end_datetime
- name: regex
inputs:
regex: '^\/(?:[^/]*/){12}v(?P<version>\w*)'
extracted_facets:
- start_datetime
- end_datetime
post_extraction_methods:
- name: hash
inputs:
terms:
- mip_era
- activity_id
- institution_id
- table_id
- source_id
- var_id
- version
extracted_facets: item_id
- name: vocab
inputs:
vocab: cmip6
strict: False
terms:
- var_id
- table_id
- source_id
- experiment_id
- realization_index
- initialisation_index
- physics_index
- forcing_index
- grid_label
- general_data_type
- permitted_use
extracted_facets: null
item:
extraction_methods:
- name: defaults
inputs:
collection_id: cmip6
- name: elasticsearch_aggregator
inputs:
url: elasticsearch.com
index: assets
exclude:
- var_id
extracted_facets:
- table_id
- source_id
- experiment_id
- realization_index
- initialisation_index
- physics_index
- forcing_index
- grid_label
- general_data_type
- permitted_use
- license
- general_data_type
- start_datetime
- end_datetime
post_extraction_methods:
- name: hash
inputs:
terms:
- mip_era
- activity_id
- institution_id
- table_id
- source_id
- var_id
- version
extracted_facets: id
collection:
extraction_methods:
- name: defaults
inputs:
id: cmip6
- name: elasticsearch_aggregator
inputs:
url: elasticsearch.com
index: item
exclude:
- table_id
extracted_facets:
- source_id
- experiment_id
- realization_index
- initialisation_index
- physics_index
- forcing_index
- grid_label
- general_data_type
- permitted_use
- license
- general_data_type
- start_datetime
- end_datetime
Questions:
id
for item and collection also be extracted in the elasticsearch_aggregator
?id
be kept separate from extraction_methods
?extracted_facets
for the elasticsearch_aggregator
is effectively an include
does this negate the use of exclude?extraction_methods
still the right name?ids
be inferred across asset/item/collection to stop duplication? Or is it better to be explicit?Example with ids
extracted:
paths:
- /badc/cmip6/data/
asset:
id:
method: auto
extraction_methods:
- name: defaults
inputs:
license: CC-BY-SA-4.0
general_data_type: climate models
permitted_use:
- academic
- educational
- commercial
- policy
- personal
extracted_facets:
- licence
- general_data_type
- permitted_use
- name: regex
inputs:
regex: '^(?P<var_id>[^_]+)_(?P<table_id>[^_]+)_(?P<source_id>[^_]+)_(?P<experiment_id>[^_]+)_r(?P<realization_index>\d*)i(?P<initialisation_index>\d*)p(?P<physics_index>\d*)f(?P<forcing_index>\d*)_(?P<grid_label>[^_.]+)'
pre_processors:
- name: filename_reducer
extracted_facets:
- var_id
- table_id
- source_id
- experiment_id
- realization_index
- initialisation_index
- physics_index
- forcing_index
- grid_label
- name: regex
inputs:
regex: '_(?P<start_datetime>\d*)-(?P<end_datetime>\d*)\.'
pre_processors:
- name: filename_reducer
post_processors:
- name: isodate_processor
inputs:
format: '%Y%m'
date_keys:
- start_datetime
- end_datetime
extracted_facets:
- start_datetime
- end_datetime
- name: regex
inputs:
regex: '^\/(?:[^/]*/){3}(?P<mip_era>\w*)\/(?P<activity_id>\w*)\/(?P<institution_id>[\w-]*)\/(?:[^/]*/)'
extracted_facets:
- start_datetime
- end_datetime
- name: regex
inputs:
regex: '^\/(?:[^/]*/){12}v(?P<version>\w*)'
extracted_facets:
- start_datetime
- end_datetime
post_extraction_methods:
- name: hash
inputs:
terms:
- mip_era
- activity_id
- institution_id
- table_id
- source_id
- var_id
- version
extracted_facets: item_id
- name: vocab
inputs:
vocab: cmip6
strict: False
terms:
- var_id
- table_id
- source_id
- experiment_id
- realization_index
- initialisation_index
- physics_index
- forcing_index
- grid_label
- general_data_type
- permitted_use
extracted_facets: null
item:
id:
method: hash
inputs:
terms:
- mip_era
- activity_id
- institution_id
- table_id
- source_id
- var_id
- version
extraction_methods:
- name: defaults
inputs:
collection_id: cmip6
- name: elasticsearch_aggregator
inputs:
url: elasticsearch.com
index: assets
exclude:
- var_id
extracted_facets:
- table_id
- source_id
- experiment_id
- realization_index
- initialisation_index
- physics_index
- forcing_index
- grid_label
- general_data_type
- permitted_use
- license
- general_data_type
- start_datetime
- end_datetime
collection:
id:
name: defaults
inputs: cmip6
extraction_methods:
- name: elasticsearch_aggregator
inputs:
url: elasticsearch.com
index: item
exclude:
- table_id
extracted_facets:
- source_id
- experiment_id
- realization_index
- initialisation_index
- physics_index
- forcing_index
- grid_label
- general_data_type
- permitted_use
- license
- general_data_type
- start_datetime
- end_datetime
More notes:
How to handle defaults:
Do we need extracted_facets
Simple example:
paths:
- /badc/cmip6/data/
asset:
id:
method: auto
extraction_methods:
- name: defaults
inputs:
license: CC-BY-SA-4.0
permitted_use:
- academic
- educational
extracted_facets:
- licence
- permitted_use
- name: regex
inputs:
regex: '^(?P<var_id>[^_]+)_(?P<table_id>[^_]+)_(?P<source_id>[^_]+)_(?P<experiment_id>[^_]+)'
pre_processors:
- name: filename_reducer
extracted_facets:
- var_id
- table_id
- source_id
- experiment_id
post_extraction_methods:
- name: vocab
inputs:
vocab: cmip6
strict: False
terms:
- var_id
- table_id
- source_id
- experiment_id
- permitted_use
extracted_facets: null
item:
id:
method: hash
inputs:
terms:
- table_id
- source_id
- experiment_id
extraction_methods:
- name: elasticsearch_aggregator
inputs:
url: elasticsearch.com
index: assets
exclude:
- var_id
extracted_facets:
- table_id
- source_id
- experiment_id
- permitted_use
- license
collection:
id:
name: defaults
inputs: cmip6
extraction_methods:
- name: elasticsearch_aggregator
inputs:
url: elasticsearch.com
index: item
extracted_facets:
- table_id
- source_id
- experiment_id
- permitted_use
- license
@rhysrevans3: that example looks quite nicely structured. We can run it past the team later.
General concept for restructuring item descriptions
What should we call them?
Better name: Collection Descriptions
Structure of each document
Representing facets in the document hierarchy
We are proposing a new approach:
Asset facets will appear in Assets:
Item facets will appear in Items:
Collection facets (properties) will appear in Collection records, and include:
Proposed representing in YAML