schema.org indexing recognizes 'https://schema.org' and not 'http://schema.org'

gothub commented 3 years ago

This issue was tranferred from the metacat repo issue, as the desired solution is to change the SPARQL queries in this repo to address this problem, and not have metacat update documents to use the 'https://schema.org' namespace.

When manually uploading a schema.org document with the JSON-LD context set to

    "@context": {
      "@vocab": "http://schema.org/"
    },

none of the SO:Dataset fields are indexed to Solr. The reason for this is that when metacat-index serializes the document to RDF/XML, all SO predicates are serialized as that context, for example:

<https://dataone.org/datasets/doi%3A10.18739%2FA2JQ0SW4G> <http://schema.org/datePublished> "2021-01-01T00:00:00Z" .

The SPARQL queries that are used to extract info from the document all use the 'https://schema.org' namespace.

Do we need to support both "http://schema.org" and "https://schema.org". It looks like the transition from http to https may linger for a long time, e.g. https://schema.org/docs/faq.html#19

Note that the slender node implementation converts harvested documents from "http://schema.org" to "https://schema.org"

If we do support both, then which of the following should be used to implement:

wrangle the SPARQL queries to support either namespace
update metacat so that it modifies the documents

Here are the test docs indexing result:

with 'http://schema.org', didn't index properly
- https://mn-sandbox-ucsb-2.test.dataone.org/knb/d1/mn/v2/query/solr/q=id:%22urn:uuid:afb4184a-e02f-44e3-9bc5-2885b05b3ee9%22
with 'https://sche.org', did index properly
- https://mn-sandbox-ucsb-2.test.dataone.org/knb/d1/mn/v2/query/solr/?q=id:%22urn:uuid:7e6af06e-4ff6-426f-b5f8-b4f44230ffa7%22

gothub commented 3 years ago

The SPARQL queries in src/main/resources/application-context-schema-org.xml can be updated to support both namespaces "http://schema.org" and "https://schema.org".

This will significantly complicate the queries, and they will have to be fully re-tested after update.

Different methods have to be used for RDF predicates and RDF objects. For predicates, the SPARQL alternate property path operator is used. For objects, the VALUE keyword is used to match multiple values.

Here is an example of an original query that matches only https://schema.org and one modified to match both namespaces.

matches only https://schema.org:

PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX SO:   <https://schema.org/>

SELECT
    ( str(?name) as ?awardTitle)
WHERE {
    ?datasetId rdf:type SO:Dataset .
    ?awardId rdf:type SO:MonetaryGrant .
    ?awardId SO:fundedItem ?datasetId .
    ?awardId SO:name ?name .
}

matches both:

PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX SO_NS1:   <http://schema.org/>
PREFIX SO:   <https://schema.org/>

SELECT
    ( str(?name) as ?awardTitle)
WHERE {
    VALUES ?value { SO_NS1:Dataset SO:Dataset } .
    VALUES ?value2 { SO_NS1:MonetaryGrant SO:MonetaryGrant} .
    ?datasetId rdf:type ?value .
    ?awardId rdf:type ?value2 .
    ?awardId SO:fundedItem|SO_NS1:fundedItem ?datasetId .
    ?awardId SO:name|SO_NS1:name ?name .
}

mbjones commented 3 years ago

@gothub Looks interesting. Shouldn't the last line be ?awardId SO:name|SO_NS1:name ?name .?

taojing2002 commented 3 years ago

On our Thursday's dev meeting, we discussed the issue. I think we have an agreement that the indexer will only support the namespace http://schema.org rather than https://schema.org in the SPARQL queries.

Currently the index only supports https://schema.org, so we need to change it. Also, in the slender node part, the json-ld documents need to be modified as well. @datadavev, is this true?

@mbjones @csjx

taojing2002 commented 3 years ago

I am talking about this release.

mbjones commented 3 years ago

@taojing2002 Just to clarify, from my perspective we agreed to something different -- that 1) SOSO should encourage use of http://schema.org/, but that DataONE and other consumers should liberally accept both that and https://schema.org/. We discussed several ways this could be accomplished, but the simplest I think is what @datadavev proposed -- for us to use a local copy of the http context file and load that whenever we encounter the https variant of the namespace. Given that the current SOSO guidance is that people should use https, I think its a strategic mistake for us to not support the SOSO recommended namespace in addition to http.

To be clear, I can see several cases we need to deal with...

Case 1: Loading context file by URL

Here we find either "@context": "http://schema.org/" or "@context": "http://schema.org/" in the document. In both cases, the JSON-LD processor will find the context file located on schema.org and load the vocabulary defined there using the http namespace, and all will be good. No extra steps needed for us.

Case 2: @vocab with `http` namespace

Here we find "@context": { "@vocab": "http://schema.org/" } in the document, defining the default vocabulary prefix, and again all should be more or less fine, as unqualified keys will be interpreted to be in the http namespace.

Case 3: @vocab with `https` namespace

Here we find "@context": { "@vocab": "https://schema.org/" } in the document, defining the default vocabulary prefix, so now unqualified keys will be interpreted to be in the https namespace. This would be a problem that we need to deal with differently, and where one of the various solutions that we discussed during the call could be employed:

rewrite the document during the canonicalization process to use the http namespace before it gets parsed
rather than taking the vocab at face value, instead load a pre-cached version of the context file for that https-based namespace that defines the classes in the http namespace. I'm not entirely clear how this would work, but I think @datadavev had an idea in mind that was similar to XML catalogs
load an additional set of triples that define owl:equivalentClass and owl:equivalentProperty alignments between each of the http and https variants of schema.org. Then, be sure that SPARQL queries use this when they are extracting information from the graph (e.g., a query for instances of type http://schema.org/Dataset would also return instances of `https://schema.org/Dataset via the equivalence relation). This requires more than just RDF querying -- it requires a reasoner be used.
other options?

I'm not sure if I got all of the cases or got the proposed solutions right, but I thought it might be helpful to summarize them all so that we can make a clear choice that handles both namespaces properly. Please add other cases or correct me where I've gone astray. Thanks!

datadavev commented 3 years ago

The basic approach is:

Grab a copy of the schema.org context and adjust it a bit by adding the @list keywords as needed.
When loading JSON-LD the processor should retrieve the remote context. The JSON-LD processing spec identifies hooks for a processor [1] to be modified to retrieve a local copy of the context instead of one from https://schema.org/
The local context is loaded and the @list keyword used by the processor

This is easy to do in python, not sure about Jena's support for the JSON-LD processing spec. Could always pre-process before handing to Jena.

If the JSON-LD uses "@vocab":"https://schema.org/" then a remote context will not be loaded, so things get messy. One approach is to expand [2] the document, then compact [3] it using a context that matches the namespace of the @vocab. After compaction, the document can be handled as JSON since it will be in a familiar structure. Then in JSON, change the target of @context to point at the preferred context, and proceed with 1-3 above. The important part is the expansion and compaction steps, since otherwise the structure of the JSON may vary considerably.

1: https://www.w3.org/TR/json-ld11-api/#loaddocumentcallback 2: https://www.w3.org/TR/json-ld11-api/#expansion-algorithm 3: https://www.w3.org/TR/json-ld11-api/#compaction-algorithm

datadavev commented 3 years ago

After discussion with Jing, the following approach will be taken:

Duplicate the indexing rules to support querying both http and https namespace variants.

This will provide initial support for both variants of the namespace. It will not enforce the @list keyword for consistent ordering unless it is provided specifically in the JSON-LD document.
Jena 4.1 supports replacement of context on load of JSON-LD. Verify functionality in version used by DataONE and intercept requests to any of these with a local copy of the schema.org context (these all resolve to the same document):
```
http://schema.org/
http://schema.org/
https://schema.org/
https://shema.org/
http://schema.org/docs/jsonldcontext.jsonld
https://schema.org/docs/jsonldcontext.jsonld
```
The basic approach is described in JSONLD-java which is used by Jena.
Modify the local copy of the context to include the @list keyword on properties where order is important (e.g. creator)

The above will support indexing of SOSO Dataset docs that reference the schema.org context using a construct like (and similar variants):

{
  "@context": "http://schema.org/"
}

It will not fully support a construct like:

{
  "@vocab":"https://schema.org/"
}

unless such documents specifically include the @list keyword on properties where it is needed.

datadavev commented 3 years ago

Jena has a test that demonstrates context override:

https://github.com/apache/jena/blob/main/jena-arq/src/test/java/org/apache/jena/riot/TestJsonLDReader.java

gothub commented 3 years ago

@taojing2002 and I are updating JsonLdSubprocessor to use the approach that @datadavev has developed for gmn/slendernode, and uses jsonld-java. The approach is:

metacat will store the JSONLD document as it is provided by the user
when indexing occurs via d1_cn_index_processor (JsonLdSubprocessor), these steps are performed:
- the document is expanded (JsonLDProcessor.expand())
- the expanded document is inspected to determine if 'https://schema.org' or 'http://schema.org' is used
- the document is compressed using a cached version of the schema.org, so that all schema.org IRIs are shortened to terms
- the document is then expanded with '@context : "http://schema.org"', so that the SO namespace of schema properties will match the SPARQL queries used for indexing
  - note that in this step, "creator" terms are converted to a JSON list

gothub commented 3 years ago

The JsonLdSubprocessor has been updated to:

use JSONLD document loader to load appropriate schema.org context files for expansion and compaction operations
the goal is to ensure that the input JSONLD document is preprocessed to always use http://schema.org so preprocessed document matches namespace used by indexing SPARQL queries
also, always ensure that "creator" entries are always representes as a "@list" so the first creator can be extracted as the index field "author"
the location of the schema files is search in this order:
- location set by DataONE or metacat configuration
- if config not set, check in /etc/dataone/index/contexts
- if those don't exist, use fallback contexts files in d1_cn_index_processor jar file

gothub commented 3 years ago

d1_cn_index_processor v2.3.13 implements the changes described in the previous 2 posts that allow processing of input documents with either @https://schema.org or @http://schema.org specified as one of the following:

"@context": {
     "@vocab":"http://schema.org/"
}

or

"@context": {
    "@vocab":"https://schema.org/"
}

or

"@context": "http://schema.org/"

or

 "@context": "https://schema.org/",

DataONEorg / d1_cn_index_processor