Closed gothub closed 3 years ago
@mbjones @datadavev @taojing2002
The RdfXmlSubprocessor can be used to index JSON-LD, after the source documents have been processed by Java Jena, into an RDF model.
In preparing for this, I've manually written and tested SPARQL queries that will be put into the appropriate Spring application-context file that will be used by the subprocessor. One problem with this approach is that the order of the properties in the document is not preserved when retrieving them via a SPARQL query.
When retrieving SO:creator
properties that will be used to populate the Solr author
file, there doesn't appear to be a way to retrieve the one that is the first in the document. So the convention of placing the primary creator
first in the document can't be honored when populating the author
field.
How important is this? Is there another approach that we should use?
Note that there is no sub-property of SO:creator
that could be used to more specifically denote role (i.e. "primary", "contributor", ...)
It's critical to preserve creator ordering. Within JSON-LD, that can be done with @list
keywords: https://json-ld.org/spec/latest/json-ld/#sets-and-lists
Whatever we do within the RDF parser, it should preserve order if a @list
is used. If that can't be done, it may not be appropriate to use RDF to do the parsing and extraction.
JSON-LD is an RDF serialization. If ordering within an array of elements is to be preserved, then it must be constructed as an ordered list. In JSON-LD this is achieved with the @list
keyword. Since other JSON-LD tools may be in the workflow between source and consumer, this construct must be done at the source.
See also: ESIPFed/science-on-schema.org#135
d1_cn_index_processor v2.3.13 supports indexng schema.org Dataset descriptions from JSON-LD documents.
Note that these Solr fields process the corresponding SO properties as lists to ensure that ordering of elements from the source document is preserved: | JSON-LD property | Solr field | multivalued field in Solr? |
---|---|---|---|
creator | author | N | |
creator | origin | Y | |
creator.givenName | authorGivenName | N | |
creator.familyName | authorLastName | N | |
creator.familyName | investigator | Y | |
description | abstract | N |
DataONE CN indexing will support indexing of schema.org records that contain schema:dataset descriptions as recommended in the Google Search Guide for DataSets. Additional recommendations are included from the ESIP Federation schema.org cluster in their "Science on schema.org" Dataset guide
Documents will be harvested to a special DataONE SlenderNode from participating repositories. The DataSet descriptions are harvested from repository dataset landing pages, by extracting the JSON-LD text from an HTML <script> element.