ESIPFed / science-on-schema.org

science-on-schema.org - providing guidance for publishing schema.org as JSON-LD for the sciences
Apache License 2.0
109 stars 31 forks source link

context must be a map #151

Closed fils closed 2 years ago

fils commented 3 years ago

So I have been running into this with @smrgeoinfo and I saw it in the example by @datadavev

Using Dave's example of

{
  "@context":"https://schema.org/",
  "@type":"Dataset",
  "name":"test",
  "description": "This is a description of the test. Here's some more words to make it long enough."
}

If you place this in the JSON-LD playground link you will see it expands to http, not https

modify the context to a map as

{
  "@context": {
    "@vocab": "https://schema.org/"
  },
  "@type": "Dataset",
  "name": "test",
  "description": "This is a description of the test. Here's some more words to make it long enough."
}

It will expand correctly with https as at https://tinyurl.com/y99kj7d7

reference https://www.w3.org/TR/json-ld/#context-definitions

specifically:

A context definition MUST be a map whose keys MUST be either terms, 
compact IRIs, IRIs, or one of the keywords @base, 
@import, @language, @propagate, @protected, @type, @version, or @vocab.

It would appear that we need to make sure examples and recommendations (at least if we want JSON-LD 1.1, which I suspect this is part of) must be maps.

I've been running into this issue in some of my development work.... Comments and observations welcome..

datadavev commented 3 years ago

Contexts can either be directly embedded into the document (an embedded context) or be referenced using a URL. -- w3.org/TR/json-ld11/

The JSON-LD processor makes a request like:

 curl -v -H "Accept: application/ld+json" "https://schema.org/" > /dev/null

it gets back a response that includes a link:

link: </docs/jsonldcontext.jsonld>; rel="alternate"; type="application/ld+json"

That is followed to the context document located at https://schema.org/docs/jsonldcontext.jsonld which is the remote context referenced in the example. That context specifies, among other items:

"schema": "http://schema.org/",

Hence, the properties are expanded with the namespace http://schema.org/.

This is exactly why we needed clarification on the "https" vs "http" namespace issue in #52.

I agree that sticking with https://schema.org/ as the namespace does require specifying the default context like:

"@context: {"@vocab":"https://schema.org/"}
fils commented 3 years ago

@datadavev

Thanks for the nice expansion...

Going further you can look at the context file pulled down and look for http

https is sadly missing and curl for either https://schema.org/docs/jsonldcontext.jsonld or http://schema.org/docs/jsonldcontext.jsonld returns the same file.. don't get me started..

looking for http (or https via substring match) we get

~/tmp grep http jsonldcontext.json 
        "@vocab": "http://schema.org/",
        "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
        "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
        "xsd": "http://www.w3.org/2001/XMLSchema#",
        "schema": "http://schema.org/",
        "owl": "http://www.w3.org/2002/07/owl#",
        "dc": "http://purl.org/dc/elements/1.1/",
        "dct": "http://purl.org/dc/terms/",
        "dctype": "http://purl.org/dc/dcmitype/",
        "void": "http://rdfs.org/ns/void#",
        "dcat": "http://www.w3.org/ns/dcat#",
        "httpMethod": { "@id": "schema:httpMethod"},

yet.. in the example https://tinyurl.com/y99kj7d7 things correctly expand to their https namespace, not http. Any insight into why this is the case?

This seems like it should not occur is the above context is pulled. Seems like application logic coming into play perhaps?

datadavev commented 3 years ago

This is the challenge of namespace ambiguity introduced by the "s". Despite progression towards a duality of schema.org concepts under http and https, the official and current context for schema.org resides at https://schema.org/docs/jsonldcontext.jsonld and that context specifies http://schema.org/ as the namespace.

Writing:

"@context": {"@vocab":"https://schema.org/"}

tells the JSON-LD processor that the entire context definition for the document is exactly the map that is the value of the "@context" key. Since that map does not contain a reference to a remote context (i.e. using the @import key), that map is the entirety of the context and so the JSON-LD processor does not retrieve a remote context when processing the document. Instead, the default context IRI specified by the value of @vocab is used to expand the relative IRIs in the document. Dataset is equal to https://schema.org/Dataset.

It's important to note that remote contexts are retrieved by a JSON-LD processor by following the spec for Remote Document and Context Retrieval. Basically, requests are made, following 303 redirects and using a Accept: application/ld+json header. Steps 4 and 5 therein describe how Link headers in the response are handled, and this step is typically not visible when using curl and other common HTTP clients unless specifically looking for that information.

Anyway, the outcome of all this is that specifying a context of "@context":{"@vocab":"https://schema.org/"} means that is the entire context. Specifying "@context":"https://schema.org/" means the JSON-LD processor will go and fetch a context document from that IRI, and that document provides the context map that uses a namespace of http://schema.org/ for the schema.org terms.

This of course does have much broader implications, since in specifying the context of "@vocab":"https://schema.org/", none of the information in the remote context is being retrieved and utilized in the processing of the document.

[edit: added note on default context]

fils commented 3 years ago

It is as I figured.... I appreciate the confirmation though. Sigh. From a developer POV, this little "s" really cause a lot of "hit" (sorry. there is a missing "s" in that "hit") ;)

datadavev commented 3 years ago

It's a widespread challenge, e.g. https://github.com/RDFLib/rdflib/issues/1120

smrgeoinfo commented 3 years ago

its the cost of conflating the location of the resolver to dereference an identifier with the identifier.

datadavev commented 3 years ago

Note that this issue will vaporize when schema.org v 12 comes out in March.

See: https://github.com/schemaorg/schemaorg/blob/main/data/releases/12.0/schemaorgcontext.jsonld

fils commented 3 years ago

@datadavev you made my day!!!!!!!

datadavev commented 3 years ago

Big relief for me too - there's a whole bunch of normalization code and gymnastics that can go away. Huzzah!

bonnland commented 2 years ago

Hi, could someone confirm if these two @context definitions are different or equivalent now? I'm seeing both forms in the ESIP recommendations examples, and I want to know if there is a "more correct" version:

{
  "@context": "https://schema.org/",
  "@type": "Dataset",
  "author": {
    "@type": "Person",
    "name": "Jane Goodall"
  }
}

vs.

{
  "@context": {
    "@vocab": "https://schema.org/"
  },  
  "@type": "Dataset",
  "author": {
    "@type": "Person",
    "name": "Jane Goodall"
  }
}
fils commented 2 years ago

@bonnland

The first is valid for JSON-LD 1.0
The second for JSON-LD 1.1

If you are working at this point forward, you should be using the map, the second one.

mbjones commented 2 years ago

We probably should update all of our examples to use the recommended form.

datadavev commented 2 years ago

Those two contexts are quite different. The first basically indicates "use the context that you can find at this address" (remote context ^1), the second "the default context for this document is this value" (default vocabulary^2).

fils commented 2 years ago

@datadavev I get your point.. that is only true in the context (no pun intended) that you view the document as a JSON-LD 1.1 document in both cases, correct?

I need to revisit now why I had processing errors in 1.1 mode with the previous approach when, as you point out, it seems a valid 1.1 pattern for remote context. (though that seems very poorly worded in the docs.. since all the contexts are typically web resolved in principle)

oddly there is

A context definition MUST be a map whose keys MUST be either terms, compact IRIs, IRIs, or one of the keywords @base, @import, @language, @propagate, @protected, @type, @version, or @vocab.

which seems at odds with the remote context reference https://www.w3.org/TR/json-ld11/#example-5-referencing-a-json-ld-context

Have you had the previous (un-mapped version) fail in a forced 1.1 process? I have.

fils commented 2 years ago

@datadavev Is it just me or the docs say...

"a context MUST be a map, except when it's not a map and then it is a remote context, though you can use @import for a remote context too, to make the context a map.... oh .. and any context you provide that isn't relative, is pulled remotely based on the IRI you provide" (this seems even more fun to read if you do it in an English accent) ;)

that seems less than wonderful :)

datadavev commented 2 years ago

it is messy, and further complicated by the opacity of what can go on behind the scenes when retrieving a remote context [^1].

If the value of @context is a relative or absolute URL, the document retrieved from that URL becomes the context.

In this case:

{
  "@context": "http://shorturl.at/ciqMW",
  "title": "A remote context doc"
}

the contents of the document retrieved by following the rules for JSON-LD retrieval becomes the context. That URL resolves to the JSON-LD:

{
  "@context": {
    "@vocab":"http://a.b/c/"
  }
}

That JSON-LD is processed like:

{
  "@context": {
    "@vocab":"http://a.b/c/"
  },
  "title": "A remote context doc"
}

and so expands like:

[
  {
    "http://a.b/c/title": [
      {
        "@value": "A remote context doc"
      }
    ]
  }
]

On the other hand, if the value of @context is a map, then that map becomes the context. So for example:

{
  "@context": {
    "@vocab": "http://shorturl.at/ciqMW/"
  },
  "title": "A local context doc"
}

The context is exactly as written, and the document expands to:

[
  {
    "http://shorturl.at/ciqMW/title": [
      {
        "@value": "A local context doc"
      }
    ]
  }
]

[^1]: https://www.w3.org/TR/json-ld11-api/#loaddocumentcallback, especially steps 4-5

fils commented 2 years ago

@datadavev

Your post above really needs to go into the docs and It's more clear the JSON-LD docs IMHO. I do follow what you are saying and based on that I think I have a bug report to make up for a JSON-LD lib I use. :)

mbjones commented 2 years ago

Just to clarify all of this, I think our recommendations have shifted but we have not updated our documentation. Now that schema.org has clarified that the true namespace is http://schema.org/, but that https://schema.org/ can be used to retrieve a context file, I think this is what we are recommending:

  1. Best option for context
{
  "@context": {
    "@vocab": "http://schema.org/"
  },  
  "@type": "Dataset",
  "author": {
    "@type": "Person",
    "name": "Jane Goodall"
  }
}
  1. Acceptable for context
{
  "@context": "http://schema.org/",  
  "@type": "Dataset",
  "author": {
    "@type": "Person",
    "name": "Jane Goodall"
  }
}

OR

{
  "@context": "https://schema.org/",  
  "@type": "Dataset",
  "author": {
    "@type": "Person",
    "name": "Jane Goodall"
  }
}
  1. Incorrect / invalid as it produces the wrong namespace (https)
{
  "@context": {
    "@vocab": "https://schema.org/"
  },  
  "@type": "Dataset",
  "author": {
    "@type": "Person",
    "name": "Jane Goodall"
  }
}

If this is right, we need to updated all docs, guidelines, examples, and shacl rules.

mbjones commented 2 years ago

Started branch feature_151_context_namespace for fixing the namespace context consistency issues. More changes needed before we have a consistent set of guides.

datadavev commented 2 years ago

(1) has the effect of setting the default vocabulary. (2) has the effect of including the context statements defined in the referenced context document.

Effectively (1) replaces the document https://schema.org/docs/jsonldcontext.jsonld with the document:

  "@context": {
    "@vocab": "http://schema.org/"
  }

Hence, the general recommendation would be (2).

smrgeoinfo commented 2 years ago

@mbjones in your recent post it says "schema.org has clarified that the true namespace is http://schema.org", but in the examples 'http://schema.org/' is used (with the terminal backslash). I'm guessing the true namespace should be http://schema.org/?

datadavev commented 2 years ago

For reference, the schema.org context document, and so namespace definition, is located here: https://schema.org/docs/jsonldcontext.jsonld

smrgeoinfo commented 2 years ago

the @vocab there is http://schema.org/, there's my answer. Thanks!

mbjones commented 2 years ago

Thanks for the clarifications, and yes, I should have said http://schema.org/. I'll go fix that.

mbjones commented 2 years ago

So, if the preference is for option 2, in our full example, how do we define the additional namespaces we need? Right now, on the branch I have the full.jsonld example as:

"@context": {
"@vocab": "http://schema.org/",
"prov": "http://www.w3.org/ns/prov#",
"provone": "http://purl.dataone.org/provone/2015/01/15/ontology#",
"spdx": "http://spdx.org/rdf/terms#"
}

Should the guidance be that we recommend option 2, except for when people need to define additional namespace prefixes?

datadavev commented 2 years ago
"@context": [
    "https://schema.org/",
    {
        "prov": "http://www.w3.org/ns/prov#",
        "provone": "http://purl.dataone.org/provone/2015/01/15/ontology#",
        "spdx": "http://spdx.org/rdf/terms#"
    }
]

[edit: use https for schema.org retrieval]

fils commented 2 years ago

So to be clear the schema.org FAQ at https://schema.org/docs/faq.html#19 is now wrong? Schema.org is saying to use http? Also the developer section at https://schema.org/docs/developers.html shows there are multiple context files for the various namespace approaches. Yet our recommendation is to stick with the old http pattern?

datadavev commented 2 years ago

I think the FAQ is a bit misleading. The namespace is http://schema.org/, associated documents (such as the context) can be retrieved using http or https. The context document for schema.org defines the namespace and that is currently located at https://schema.org/docs/jsonldcontext.jsonld.

However, just to confuse things more, there are http and https variants of the vocabulary!

fils commented 2 years ago

That's what I mean.. the multiple vocab elements. I understand all of this. and I appreciate that currently the https file call returns http namespaced file (which I don't agree with) :)

this just worries me... it's a kicking the can down the road event IMHO.

agree to disagree I guess

datadavev commented 2 years ago

Adding to the confusion, some libraries, e.g. RDFLib internally define constants for common namespaces, and it is using https://schema.org/ as the namespace. So I guess be prepared to be flexible.

fils commented 2 years ago

The libraries are going to be a pain.. major pain..
Also, you can't content negotiate for the schema.org JSON-LD context anyway. Due to DOS issues they don't allow it so then libraries have to implement the resolution as a special case.

you can't curl negotiate at https://schema.org for the context.

datadavev commented 2 years ago

Right, there's a different set of rules beyond simple content negotiation^1 for finding the context - need to look at the response link header:

link: </docs/jsonldcontext.jsonld>; rel="alternate"; type="application/ld+json"

This is also something that is poorly implemented in the major libs (pyld and rdflib at least). I use a patched version of pyld to get around this issue and honor the json-ld processing rules in the spec.

fils commented 2 years ago

right..

curl -v https://schema.org
...
link: </docs/jsonldcontext.jsonld>; rel="alternate"; type="application/ld+json"
...

and I get it.. (literal and figurative) ;)

As you point out though the issues with the python libraries (same as in the Go libraries by the way)..
This is an implementation mess... my point though is that the trend in general will be toward https not away and since both namespace uses are accepted by schema.org (unless that policy is now changed?) we are tossing out future LOD patterns if we go http since the data web will be https, it has to be.

I'm not trying to change any minds. It sounds like it is already a done deal. I just have to resolve how to connect the other groups I work who are https focused now with SOS which will be http focused.

mbjones commented 2 years ago

I don't think it's a done deal if @fils and @datadavev aren't on board -- you two have more practical experience with this than anyone I know. I am just trying to clean up our recs and be consistent. And I don;'t have a strong opinion myself -- I agree the future is https, but thought SO had decided to stick with http in their context doc. If there is a straightforward way for us to recommend https where most libs and the shacl processor, etc would recognize the terms as SO properly, then that has advantages. But given that https://schema.org/ returns a JSON-LD context with the http namespace, it seems like they are still using http. Please, propose what you think we should do, and how providers and consumers should handle it.

fils commented 2 years ago

You are correct there.. their default is to return the http namesapce even though they are rather indecisive elsewhere in their documentation. The result of that unfortunately is they seed confusion and delay (cue Thomas the Tank Engine) in the library developers and elsewhere. :)

datadavev commented 2 years ago

Science-on-schema.org is about recommendations for application of schema.org to this domain, and so my impression is this group should not be overriding the specification. Hence, the recommendation here should be to use the namespace as published, which would be http://schema.org/. Options for specifying the context then include:

  1. {
     "@context":"https://schema.org/"
    }
  2. {
     "@context":"https://schema.org/docs/jsonldcontext.jsonld"
    }
  3. {
     [
       "@context":"https://schema.org/",
       {
         "prov": "http://www.w3.org/ns/prov#",
         "provone": "http://purl.dataone.org/provone/2015/01/15/ontology#",
         "spdx": "http://spdx.org/rdf/terms#"
       }
     ]
    }
  4. {
     "@context": {
         "@vocab": "http://schema.org/"
     }
    }

Where:

  1. Remote context reference (note that http or https may be specified here)
  2. Functionally equivalent to (1). The JSON-LD processor should resolve to this document from (1) if it implements the specification for following link headers.^1
  3. Remote context reference for schema.org and including other namespaces. Note that other remote contexts may also be specified in the list.
  4. Ignores the remote schema.org context, but makes http://schema.org/ the default namespace for the document.

Implementors should be aware that this may change in the future (i.e. "http" -> "https") and that existing implementations may internally use "https://schema.org/" as the namespace (e.g. RDFLib). Hence consumers should probably be applying namespace normalization to schema.org content to ensure consistent interpretation in an RDF processing environment.

smrgeoinfo commented 2 years ago

+1 on Recommending namespace normalization. Dealing with the two namespaces has been an ongoing challenge with metadata integration in EarthCube GeoCodes, requiring messy SPARQL queries.

mbjones commented 2 years ago

OK, summarizing... going with Dave's examples, I'll write up a plan to recommend using the http namespace definition (as SO uses by default) by retrieving the context file from the https location, noting that its also possible to retrieve it from the http location, and that the @vocab default can be used with http as well. We don't recommend using @vocab with the https URL, but harvesters and processors should in general normalize and treat https versions of the terms as equivalent to the http terms for SO. Finally, if one needs to include multiple namespaces, that can be done by building a context map from the retrieved context file plus additional namespace definitions. In my testing, I think the syntax in Dave's examples was a little turned around, so I think we should be using:

{
  "@context": [
    "https://schema.org/",
    {
      "prov": "http://www.w3.org/ns/prov#",
      "provone": "http://purl.dataone.org/provone/2015/01/15/ontology#",
      "spdx": "http://spdx.org/rdf/terms#"
    }
  ],
  "@type": "Dataset",
  "name": "Test data",
  "prov:wasDerivedFrom": {
    "@id": "https://doi.org/10.xxxx/Dataset-1"
  }
}
mbjones commented 2 years ago

Work on branch feature_151_context_namespace:

mbjones commented 2 years ago

Checked that shapes all validate with the namespace changes on our example files, and merged PR #199. This issue will remain open for commentary for a bit longer, but the planned changes are now merged into develop.

mbjones commented 2 years ago

Reviewed at meeting on 7 Feb 2022 -- agreed it was complete, but reopen this issue if discrepancies are found.