generate list of carbon-cycle data sets

DataONEorg / arctic-semantics

Semantic annotation work for the Arctic Data Center member repository

Apache License 2.0

0 stars 1 forks source link

generate list of carbon-cycle data sets #2

Open mbjones opened 6 years ago

mbjones commented 6 years ago

Need to generate the list of carbon-cycle data sets to be annotated. Start with one or more SOLR queries from https://arcticdata.io/catalog, and compile these into a parseable data table with appropriate attributes.

mobb commented 6 years ago

From notes, only ~1500 currently have attribute descriptions. JEsse’s team is working through the other 3500 to add attribute-level metadata. Any dataset in the ADC is a candidate for annotation, so all datasets will need to be examined/understood at some level. We could ask the data team to add a keyword for us to query, but it would be safer to examine everything. We would rerun the query periodically.

so need a query to return query_date, pkgid,entity-name, attributename

Initial query might resemble: https://cn.dataone.org/cn/v1/query/solr/?fl=identifier,title,attribute&q=formatType:METADATA+AND+(datasource:*ARCTIC)+AND+-obsoletedBy:*+AND+(attribute:*)&rows=100&start=0

Issues:

this query only returned 688, not 1500.
Still need to know what the solr query calls the entity field. not finding it here: http://indexer-documentation.readthedocs.io/en/latest/generated/solr_schema.html

later queries will need to add since-date, probably dateModified or dateUploaded

mobb commented 6 years ago

Jesse says there may be only 600. 1500 is what they have processed since the ACADIS migration (in April 2016). they did not define attributes at first, that began later (maybe December 2016)

mobb commented 6 years ago

Bryce says that d1 EML path dataset/dataTable/entityName is not indexed. for list of indexed fields, see: https://repository.dataone.org/software/cicore/trunk/cn/d1_cn_index_processor/src/main/resources/application-context-eml-base.xml

mobb commented 6 years ago

a similar query to ADC: returns a different number of datasets. ADC coders to investigate: https://gist.github.com/amoeba/2546994813f58edb8bc93ff6510767ef

so query ADC MN. start with this, note that MN name is still there, but explicit (no wildcards) https://arcticdata.io/metacat/d1/mn/v2/query/solr/?fl=identifier,title,attribute&q=formatType:METADATA+AND+datasource:%22urn:node:ARCTIC%22+AND+-obsoletedBy:*+AND+(attribute:*)&rows=100&start=0

mobb commented 6 years ago

We will want to recommend to the ADC data interns which datasets they should focus on, as they enhance the metadata (add attribute descriptions). So will want to review all the datasets. It would help to do this systematically. A creator tends to put in the same type of datasets, so examining chunks by creator could be a workable strategy. working on that query.