DataONEorg / d1_cn_index_processor

The CN index processor component
0 stars 1 forks source link

Index JSON-LD documents containing SO:Dataset descriptions #3

Closed gothub closed 3 years ago

gothub commented 3 years ago

DataONE CN indexing will support indexing of schema.org records that contain schema:dataset descriptions as recommended in the Google Search Guide for DataSets. Additional recommendations are included from the ESIP Federation schema.org cluster in their "Science on schema.org" Dataset guide

Documents will be harvested to a special DataONE SlenderNode from participating repositories. The DataSet descriptions are harvested from repository dataset landing pages, by extracting the JSON-LD text from an HTML <script> element.

gothub commented 3 years ago

@mbjones @datadavev @taojing2002

The RdfXmlSubprocessor can be used to index JSON-LD, after the source documents have been processed by Java Jena, into an RDF model.

In preparing for this, I've manually written and tested SPARQL queries that will be put into the appropriate Spring application-context file that will be used by the subprocessor. One problem with this approach is that the order of the properties in the document is not preserved when retrieving them via a SPARQL query.

When retrieving SO:creator properties that will be used to populate the Solr author file, there doesn't appear to be a way to retrieve the one that is the first in the document. So the convention of placing the primary creator first in the document can't be honored when populating the author field.

How important is this? Is there another approach that we should use?

Note that there is no sub-property of SO:creator that could be used to more specifically denote role (i.e. "primary", "contributor", ...)

mbjones commented 3 years ago

It's critical to preserve creator ordering. Within JSON-LD, that can be done with @list keywords: https://json-ld.org/spec/latest/json-ld/#sets-and-lists

Whatever we do within the RDF parser, it should preserve order if a @list is used. If that can't be done, it may not be appropriate to use RDF to do the parsing and extraction.

datadavev commented 3 years ago

JSON-LD is an RDF serialization. If ordering within an array of elements is to be preserved, then it must be constructed as an ordered list. In JSON-LD this is achieved with the @list keyword. Since other JSON-LD tools may be in the workflow between source and consumer, this construct must be done at the source.

See also: ESIPFed/science-on-schema.org#135

gothub commented 3 years ago

d1_cn_index_processor v2.3.13 supports indexng schema.org Dataset descriptions from JSON-LD documents.

Note that these Solr fields process the corresponding SO properties as lists to ensure that ordering of elements from the source document is preserved: JSON-LD property Solr field multivalued field in Solr?
creator author N
creator origin Y
creator.givenName authorGivenName N
creator.familyName authorLastName N
creator.familyName investigator Y
description abstract N