Quality Solr index configuration

gothub commented 6 years ago

A current quality report has this structure:

<ns0:run xmlns:ns0="https://nceas.ucsb.edu/mdqe/v1">
   <id>eb95ea8d-9238-4e4f-ac99-a77c9bca6635</id>
   <timestamp>2018-03-16T15:07:12.177-07:00</timestamp>
   <suiteId>arctic.data.center.suite.1</suiteId>
   <result>
        <check>
             <id>check.nsf.award.numbers.present.1</id>
             <name>award numbers</name>
             <status>SUCCESS</status>
             ...
        </check>
   </result>
   <result>
        <check>
        ...
        </check>
    </result>
    ...
</run>

Solr doesn't handle hierarchical data like this (repeating "" elements), indexed in a single document so I'm proposing that the Solr index should be composed of two collections: "reports" and "checks".

The reports collection would contain these fields (note that new fields have been added that are not present in the current quality report). Some of the fields would be values calculated from the quality report:

metadataId (pid)
timestamp
suiteId
score
passedCount
warningCount
failedCount
informationCount
identifcationPercent
discovertyPercent
interpretationPercent
others?

The checks collection would contain all the current fields from the section ("id", "name", "status", ...) plus:

metadataId (to associate entries with the 'reports' collection, like a foreign key)
suiteId (to associate entries with the 'reports' collection)

which i believe are the fields that uniquely identify a quality report from each of the collections.

Solr supports joins (works like inner join) but only matching documents from one collection are returned - the result set is not a merging of the two collections. Because of this, if a client wanted to retrieve all info for a quality report, they would have to query the suites collection (1 doc returned), then the checks collection (multiple docs returned).

Here is a current quality report for reference:

screen shot 2018-03-16 at 4 27 33 pm

We also have to consider the update strategy for quality reports. When a report is inserted into the index for a metadata pid and suite id for which there is already an entry, do we just replace it?

Thoughts?

gothub commented 6 years ago

The Worker class now has the method indexReport() which uses a refactored version of the metacat-index component (refactored/extended from the DataONE d1_cn_index_processor component. This refactored component operates in the same manner as the original, using Spring application config files to easily add or modify conversion of input files (XML, JSON) into Solr documents.

Here is an example Solr doc from a local Solr 7.3 server that is being used for testing:

        "metadataId":"9CD87ED9-A531-419B-B24E-30CAE834EF72",
        "formatId":"https://nceas.ucsb.edu/mdqe/v1",
        "runId":"7a542172-b24b-4efc-8cc9-e77912e8e259",
        "suiteId":"arctic.data.center.suite.1",
        "timestamp":"2018-04-19T00:58:17.729Z",
        "metadata_formatId":"eml://ecoinformatics.org/eml-2.1.1",
        "metadata_datasource":"urn:node:mnTestKNB",
        "metadata_funder":["NSF Award 1635550"],
        "metadata_rightsHolder":"http://orcid.org/0000-0002-2192-403X",
        "metadata_group":["CN=knb-data-admins,DC=dataone,DC=org",
          "CN=arctic-data-admins,DC=dataone,DC=org",
          "CN=SASAP,DC=dataone,DC=org"],
        "score_green":18,
        "score_orange":2,
        "score_red":2,
        "score_blue":10,
        "score_total":32,
        "score_identification":0.8823529411764706,
        "score_interpretation":0.7272727272727273,
        "score_discovery":0.75,
        "score_composite":0.8181818181818182,
        "_version_":1598134174481383424,
        "score_other":"NaN"}]

It would be very useful to discuss what fields are needed from the metadata sysmeta and any other fields that are needed.

mbjones commented 6 years ago

OK, yeah, let's discuss. I think we should discuss this with @rushirajnenuji and @csjx because what we need here is remarkably similar to what we are doing in MDC with stats on dataset usage and citation. Especially regarding aggregating statistics.

NCEAS / metadig-engine

Quality Solr index configuration #119