DataONEorg / d1_cn_index_processor

The CN index processor component
0 stars 1 forks source link

Support for schema.org/Dataset with multiple `description` entries #22

Closed datadavev closed 3 years ago

datadavev commented 3 years ago

Some schema.org JSON-LD Dataset descriptions may include a list of values for the Dataset description. For example (truncated for brevity):

   "description": [
      "The relationship between CO2 flow from soil and soil CO2 concentration was ... ",
      "<div class=\"o-metadata__file-usage-entry\"><h4 class=\"o-heading__level3-file-title\">field_data_flow_concentration</h4><div class=\"o-metadata__file-description\">Table describes values of soil CO2 concentration ..."
    ],

From: https://so.test.dataone.org/mnTestDRYAD/v2/object/sha256:7f5d0aab7e3025626b5bb869b6ac51203327f17a24d53073234ca42a4bca7fe3

The indexer should:

  1. Inject "@container":"@list" into the context, as for identifier and creator.
  2. Treat description as an ordered list
  3. Concat values from the list, delimited by \n.

If concatenation raises issues, then defer concatenation for a later release and use the first value from the list. In this case, create a new issue documenting the need to support concatenation.

gothub commented 3 years ago

The query that populates abstract has been updated to treat multiple SO:description entries as a list, so that the first one in the document can be retrieved. With the modifications of the https://github.com/DataONEorg/d1_cn_index_processor/commit/e3a4900fbc7a99e5265b20a6228d82ca1b310cb9 commit, only the first SO:description is retrieved, as the method to concatenate multiple entries into the single return value, with Jena list processing and SPARQL string functions was not readily apparent, in time for this release.