DataONEorg / d1_cn_index_processor

The CN index processor component
0 stars 1 forks source link

schema.org indexing doesn't process creator without context declaration #16

Closed gothub closed 3 years ago

gothub commented 3 years ago

The slender node processing for schema.org documents inserts a property ("creator") in the @context section of a harvested document to allow the 'creator' properties to be processed correctly as a list. Here is an except from a properly prepared document:

{
    "@context": {
        "@vocab": "https://schema.org/",
        "creator":{
            "@container":"@list",
            "@id":"https://schema.org/creator"
        }
    },
...

During some manual testing, I inadvertently uploaded documents that don't have this fixed-up @context, and saw that these Solr fields don't get populated as a result: "author", "origin".

@mbjones @datadavev @taojing2002 Given this behaviour - should metacat fixup schema.org documents to contain this section if it hasn't been included, which could be the case if documents are added directly via R client -> metacat and not via a slender node?

Note that the "creator" fixup is needed so that RDF/XML serialization of the original json-ld document and SPARQL query processing can extract creators correctly, as the first creator in a list is extracted as the 'author' field.

mbjones commented 3 years ago

Good point @gothub . My take is that the lack of the @list indicator is not technically an error, but that order is indeterminate without it. I suspect many providers will not initially provide it, and so we need to be able to deal with that. So, I think we should:

datadavev commented 3 years ago

Without the @list the indexing will select a random creator as the author. Since the first creator has significance in the DataONE indexer, that value should be extracted in a deterministic manner. This means that the indexer must treat creator as an ordered list. This can be forced at the point of conversion to RDF by the indexer (i.e. by adjusting the JSON-LD context or creator element) or at the point of capture (i.e. on the member node). I chose the later for the slender node implementation.

gothub commented 3 years ago

At the NCEAS/DataONE weekly development meeting, we discussed that content that NCEAS/DataONE generates will have have the @context object modified to have creator processed as a list, i.e.

  "@context": {
    "@vocab": "https://schema.org/",
    "creator": {
      "@container":"@list"
    }
  },

So this issue has been superceded by https://github.com/NCEAS/metacatui/issues/1753