ResearchObject / ro-crate

Research Object Crate
https://w3id.org/ro/crate/
Apache License 2.0
84 stars 34 forks source link

RO-Crate context should use https for schema.org IRIs #349

Open dnlbauer opened 3 weeks ago

dnlbauer commented 3 weeks ago

In recent years, the use of HTTPS over HTTP to reference resources on the web has become the de facto standard across the internet. Today, many modern browsers have started to automatically redirect, or in some cases, even block traffic that attempts to use HTTP instead of the more secure HTTPS. This transition is generally seamless, as redirecting HTTP requests to HTTPS has become a common practice among HTTP and proxy servers.

Unfortunately, for linked data, using HTTP for vocabularies is still common. This is also true for RO-Crates: While the specification provides the jsonld context for RO-Crates via https (https://www.researchobject.org/ro-crate/specification/1.1/context.jsonld), internally the context is not consequent in what scheme it uses. For schema.org, the vocabulary uses HTTP (http://schema.org). For Bioschemas it uses HTTPS https://bioschemas.org.

Schema.org on the other hand, completed it's movement to HTTPS with Release 12.0 already in 2021:

  • PR #2814: Completed process of moving Schema.org to https. This step included the move to https of vocabulary term definitions and consequent change to https of the canonical URI displayed under the term pages [more...] tag. The web site will continue to respond to both http and https URLs. Download files will continue to support both protocols.

As part of this move, the context for schema.org is now available for HTTPS and HTTP separately, but HTTPS is the preferred version.

I suggest to also move all term definitions in the RO-Crate context to use HTTPS for schema.org. Not only is https more secure, but it would also provide consistency in the used scheme. This makes it easier to work with the specification, especially for less experienced users.

dnlbauer commented 3 weeks ago

Why mixing schemes can be problematic

Algorithms working with JSON-LD usually treat term definitions and the IRIs they expand to as string literals. For example, the IRIs https://schema.org/author and http://schema.org/author convey the same meaning to a human reader (and they in fact point to the same resource due to HTTP redirection), but software operating on these terms treat them as being different strings and thus different terms.

This can quickly lead to bugs in cases where multiple contexts are mixed that do not use schemes uniformly.

Example

The following excerpt from an RO-Crate defines a dataset with an author. As suggested by the specification, it uses the RO-Crate context with HTTPS. Additionally, it uses type coercion to allow to reference the authors @id directly as string, without using a construct such as {"@id": "https://orcid.org/0000-0001-9447-460X"}.

{
  "@context": [
    "https://w3id.org/ro/crate/1.1/context",
    {
      "author": {
        "@id": "https://schema.org/author",
        "@type": "@id"
      }
    },
  ],
  "@graph": [
    {
      "@id": "./",
      "@type": "Dataset",
      "name": "Test",
      "author": "https://orcid.org/0000-0001-9447-460X"
    },
    {
      "@id": "https://orcid.org/0000-0001-9447-460X",
      "@type": "Person",
      "name": "Daniel Bauer"
    }
  ]
}

Without looking into the RO-Crate context, it's reasonable to use https://schema.org/author here, isn't it? After all, the RO-Crate context is also using HTTPS, so what could possibly go wrong.

Well, expanding this document (i.e. with the library pyld), leads to the document shown below. All terms are now represented with HTTP-based IRIs, except for the author, which now uses HTTPS.

[
  {
    "@id": "./",
    "@type": ["http://schema.org/Dataset"],
    "https://schema.org/author": [
      {"@id": "https://orcid.org/0000-0001-9447-460X"}
    ],
    "http://schema.org/name": [
      {"@value": "Test"}
    ]
  },
  {
    "@id": "https://orcid.org/0000-0001-9447-460X",
    "@type": ["http://schema.org/Person"],
    "http://schema.org/name": [
      {"@value": "Daniel Bauer"}
    ]
  }
]

If we flatten this graph to a different context, i.e. only using the RO-Crate context to eliminate the type coercion, not all terms are flattened to the expected context due to this scheme mismatch:

{
  "@context": "https://w3id.org/ro/crate/1.1/context",
  "@graph": [
    {
      "@id": "./",
      "@type": "Dataset",
      "name": "Test"
      "https://schema.org/author": { "@id": "https://orcid.org/0000-0001-9447-460X" }
    },
    {
      "@id": "https://orcid.org/0000-0001-9447-460X",
      "@type": "Person",
      "name": "Daniel Bauer"
    }
  ]
}

While this could easily be resolved by using the HTTP form of schema.org for the type coercion of the author, I feel that errors like this can arise too easily with the current RO-Crate specification that suggests HTTPS for the context, but uses HTTP internally.

Especially for the inexperienced user, it's very hard to spot this subtle difference. By being more consistent (=using HTTPS everywhere), we can circumvent such problems.