DataONEorg / d1_cn_index_processor

The CN index processor component
0 stars 1 forks source link

schema.org indexing recognizes 'https://schema.org' and not 'http://schema.org' #19

Closed gothub closed 3 years ago

gothub commented 3 years ago

This issue was tranferred from the metacat repo issue, as the desired solution is to change the SPARQL queries in this repo to address this problem, and not have metacat update documents to use the 'https://schema.org' namespace.

When manually uploading a schema.org document with the JSON-LD context set to

    "@context": {
      "@vocab": "http://schema.org/"
    },

none of the SO:Dataset fields are indexed to Solr. The reason for this is that when metacat-index serializes the document to RDF/XML, all SO predicates are serialized as that context, for example:

<https://dataone.org/datasets/doi%3A10.18739%2FA2JQ0SW4G> <http://schema.org/datePublished> "2021-01-01T00:00:00Z" .

The SPARQL queries that are used to extract info from the document all use the 'https://schema.org' namespace.

Do we need to support both "http://schema.org" and "https://schema.org". It looks like the transition from http to https may linger for a long time, e.g. https://schema.org/docs/faq.html#19

Note that the slender node implementation converts harvested documents from "http://schema.org" to "https://schema.org"

If we do support both, then which of the following should be used to implement:

Here are the test docs indexing result:

gothub commented 3 years ago

The SPARQL queries in src/main/resources/application-context-schema-org.xml can be updated to support both namespaces "http://schema.org" and "https://schema.org".

This will significantly complicate the queries, and they will have to be fully re-tested after update.

Different methods have to be used for RDF predicates and RDF objects. For predicates, the SPARQL alternate property path operator is used. For objects, the VALUE keyword is used to match multiple values.

Here is an example of an original query that matches only https://schema.org and one modified to match both namespaces.

matches only https://schema.org:

PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX SO:   <https://schema.org/>

SELECT
    ( str(?name) as ?awardTitle)
WHERE {
    ?datasetId rdf:type SO:Dataset .
    ?awardId rdf:type SO:MonetaryGrant .
    ?awardId SO:fundedItem ?datasetId .
    ?awardId SO:name ?name .
}

matches both:

PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX SO_NS1:   <http://schema.org/>
PREFIX SO:   <https://schema.org/>

SELECT
    ( str(?name) as ?awardTitle)
WHERE {
    VALUES ?value { SO_NS1:Dataset SO:Dataset } .
    VALUES ?value2 { SO_NS1:MonetaryGrant SO:MonetaryGrant} .
    ?datasetId rdf:type ?value .
    ?awardId rdf:type ?value2 .
    ?awardId SO:fundedItem|SO_NS1:fundedItem ?datasetId .
    ?awardId SO:name|SO_NS1:name ?name .
}
mbjones commented 3 years ago

@gothub Looks interesting. Shouldn't the last line be ?awardId SO:name|SO_NS1:name ?name .?

taojing2002 commented 3 years ago

On our Thursday's dev meeting, we discussed the issue. I think we have an agreement that the indexer will only support the namespace http://schema.org rather than https://schema.org in the SPARQL queries.

Currently the index only supports https://schema.org, so we need to change it. Also, in the slender node part, the json-ld documents need to be modified as well. @datadavev, is this true?

@mbjones @csjx

taojing2002 commented 3 years ago

I am talking about this release.

mbjones commented 3 years ago

@taojing2002 Just to clarify, from my perspective we agreed to something different -- that 1) SOSO should encourage use of http://schema.org/, but that DataONE and other consumers should liberally accept both that and https://schema.org/. We discussed several ways this could be accomplished, but the simplest I think is what @datadavev proposed -- for us to use a local copy of the http context file and load that whenever we encounter the https variant of the namespace. Given that the current SOSO guidance is that people should use https, I think its a strategic mistake for us to not support the SOSO recommended namespace in addition to http.

To be clear, I can see several cases we need to deal with...

Case 1: Loading context file by URL

Here we find either "@context": "http://schema.org/" or "@context": "http://schema.org/" in the document. In both cases, the JSON-LD processor will find the context file located on schema.org and load the vocabulary defined there using the http namespace, and all will be good. No extra steps needed for us.

Case 2: @vocab with http namespace

Here we find "@context": { "@vocab": "http://schema.org/" } in the document, defining the default vocabulary prefix, and again all should be more or less fine, as unqualified keys will be interpreted to be in the http namespace.

Case 3: @vocab with https namespace

Here we find "@context": { "@vocab": "https://schema.org/" } in the document, defining the default vocabulary prefix, so now unqualified keys will be interpreted to be in the https namespace. This would be a problem that we need to deal with differently, and where one of the various solutions that we discussed during the call could be employed:

I'm not sure if I got all of the cases or got the proposed solutions right, but I thought it might be helpful to summarize them all so that we can make a clear choice that handles both namespaces properly. Please add other cases or correct me where I've gone astray. Thanks!

datadavev commented 3 years ago

The basic approach is:

  1. Grab a copy of the schema.org context and adjust it a bit by adding the @list keywords as needed.
  2. When loading JSON-LD the processor should retrieve the remote context. The JSON-LD processing spec identifies hooks for a processor [1] to be modified to retrieve a local copy of the context instead of one from https://schema.org/
  3. The local context is loaded and the @list keyword used by the processor

This is easy to do in python, not sure about Jena's support for the JSON-LD processing spec. Could always pre-process before handing to Jena.

If the JSON-LD uses "@vocab":"https://schema.org/" then a remote context will not be loaded, so things get messy. One approach is to expand [2] the document, then compact [3] it using a context that matches the namespace of the @vocab. After compaction, the document can be handled as JSON since it will be in a familiar structure. Then in JSON, change the target of @context to point at the preferred context, and proceed with 1-3 above. The important part is the expansion and compaction steps, since otherwise the structure of the JSON may vary considerably.

1: https://www.w3.org/TR/json-ld11-api/#loaddocumentcallback 2: https://www.w3.org/TR/json-ld11-api/#expansion-algorithm 3: https://www.w3.org/TR/json-ld11-api/#compaction-algorithm

datadavev commented 3 years ago

After discussion with Jing, the following approach will be taken:

  1. Duplicate the indexing rules to support querying both http and https namespace variants.

    This will provide initial support for both variants of the namespace. It will not enforce the @list keyword for consistent ordering unless it is provided specifically in the JSON-LD document.

  2. Jena 4.1 supports replacement of context on load of JSON-LD. Verify functionality in version used by DataONE and intercept requests to any of these with a local copy of the schema.org context (these all resolve to the same document):

    http://schema.org/
    http://schema.org/
    https://schema.org/
    https://shema.org/
    http://schema.org/docs/jsonldcontext.jsonld
    https://schema.org/docs/jsonldcontext.jsonld

    The basic approach is described in JSONLD-java which is used by Jena.

  3. Modify the local copy of the context to include the @list keyword on properties where order is important (e.g. creator)

The above will support indexing of SOSO Dataset docs that reference the schema.org context using a construct like (and similar variants):

{
  "@context": "http://schema.org/"
}

It will not fully support a construct like:

{
  "@vocab":"https://schema.org/"
}

unless such documents specifically include the @list keyword on properties where it is needed.

datadavev commented 3 years ago

Jena has a test that demonstrates context override:

https://github.com/apache/jena/blob/main/jena-arq/src/test/java/org/apache/jena/riot/TestJsonLDReader.java

gothub commented 3 years ago

@taojing2002 and I are updating JsonLdSubprocessor to use the approach that @datadavev has developed for gmn/slendernode, and uses jsonld-java. The approach is:

gothub commented 3 years ago

The JsonLdSubprocessor has been updated to:

gothub commented 3 years ago

d1_cn_index_processor v2.3.13 implements the changes described in the previous 2 posts that allow processing of input documents with either @https://schema.org or @http://schema.org specified as one of the following:

"@context": {
     "@vocab":"http://schema.org/"
}

or

"@context": {
    "@vocab":"https://schema.org/"
}

or

"@context": "http://schema.org/"

or

 "@context": "https://schema.org/",