Open gothub opened 6 years ago
The Worker
class now has the method indexReport()
which uses a refactored version of the metacat-index component (refactored/extended from the DataONE d1_cn_index_processor
component. This refactored component operates in the same manner as the original, using Spring application config files to easily add or modify conversion of input files (XML, JSON) into Solr documents.
Here is an example Solr doc from a local Solr 7.3 server that is being used for testing:
"metadataId":"9CD87ED9-A531-419B-B24E-30CAE834EF72",
"formatId":"https://nceas.ucsb.edu/mdqe/v1",
"runId":"7a542172-b24b-4efc-8cc9-e77912e8e259",
"suiteId":"arctic.data.center.suite.1",
"timestamp":"2018-04-19T00:58:17.729Z",
"metadata_formatId":"eml://ecoinformatics.org/eml-2.1.1",
"metadata_datasource":"urn:node:mnTestKNB",
"metadata_funder":["NSF Award 1635550"],
"metadata_rightsHolder":"http://orcid.org/0000-0002-2192-403X",
"metadata_group":["CN=knb-data-admins,DC=dataone,DC=org",
"CN=arctic-data-admins,DC=dataone,DC=org",
"CN=SASAP,DC=dataone,DC=org"],
"score_green":18,
"score_orange":2,
"score_red":2,
"score_blue":10,
"score_total":32,
"score_identification":0.8823529411764706,
"score_interpretation":0.7272727272727273,
"score_discovery":0.75,
"score_composite":0.8181818181818182,
"_version_":1598134174481383424,
"score_other":"NaN"}]
It would be very useful to discuss what fields are needed from the metadata sysmeta and any other fields that are needed.
OK, yeah, let's discuss. I think we should discuss this with @rushirajnenuji and @csjx because what we need here is remarkably similar to what we are doing in MDC with stats on dataset usage and citation. Especially regarding aggregating statistics.
A current quality report has this structure:
Solr doesn't handle hierarchical data like this (repeating "" elements), indexed in a single document so I'm proposing that the Solr index should be composed of two collections: "reports" and "checks".
The
reports
collection would contain these fields (note that new fields have been added that are not present in the current quality report). Some of the fields would be values calculated from the quality report:The section ("id", "name", "status", ...) plus:
checks
collection would contain all the current fields from thewhich i believe are the fields that uniquely identify a quality report from each of the collections.
Solr supports joins (works like inner join) but only matching documents from one collection are returned - the result set is not a merging of the two collections. Because of this, if a client wanted to retrieve all info for a quality report, they would have to query the
suites
collection (1 doc returned), then thechecks
collection (multiple docs returned).Here is a current quality report for reference:
We also have to consider the update strategy for quality reports. When a report is inserted into the index for a metadata pid and suite id for which there is already an entry, do we just replace it?
Thoughts?