NCEAS / metacat

Data repository software that helps researchers preserve, share, and discover data
https://knb.ecoinformatics.org/software/metacat
GNU General Public License v2.0
26 stars 12 forks source link

schema.org indexing recognizes 'https://schema.org' and not 'http://schema.org' #1510

Closed gothub closed 3 years ago

gothub commented 3 years ago

When manually uploading a schema.org document with the JSON-LD context set to

    "@context": {
      "@vocab": "http://schema.org/"
    },

none of the SO:Dataset fields are indexed to Solr. The reason for this is that when metacat-index serializes the document to RDF/XML, all SO predicates are serialized as that context, for example:

<https://dataone.org/datasets/doi%3A10.18739%2FA2JQ0SW4G> <http://schema.org/datePublished> "2021-01-01T00:00:00Z" .

The SPARQL queries that are used to extract info from the document all use the 'https://schema.org' namespace.

Do we need to support both "http://schema.org" and "https://schema.org". It looks like the transition from http to https may linger for a long time, e.g. https://schema.org/docs/faq.html#19

Note that the slender node implementation converts harvested documents from "http://schema.org" to "https://schema.org"

If we do support both, then which of the following should be used to implement:

Here are the test docs indexing result:

gothub commented 3 years ago

Superceded by https://github.com/DataONEorg/d1_cn_index_processor/issues/19