digitalbazaar / pyld

JSON-LD processor written in Python
https://json-ld.org/
Other
602 stars 129 forks source link

IRI expansion with missing `@base` does not conform to RFC 3986 #187

Open RinkeHoekstra opened 10 months ago

RinkeHoekstra commented 10 months ago

RFC 3986 section 5.1 specifies that relative URIs should be expanded against the document's base URI. In absence of an explicit base, there are prescribed steps to determine the base IRI for a given document:

5.1.1. Base URI Embedded in Content . . . . . . . . . . 29 5.1.2. Base URI from the Encapsulating Entity . . . . . 29 5.1.3. Base URI from the Retrieval URI . . . . . . . . 30 5.1.4. Default Base URI . . . . . . . . . . . . . . . . 30

The current implementation in pyLD ignores the last two requirements. For 5.1.3 this is understandable, as the library only operates on a data payload. However, 5.1.4 is the catch-all that would ensure that @id values are always expanded to absolute IRIs.

In absence of this, non-IRI @id values in documents that do not explicitly specify a base in a context are not expanded to an absolute IRI. This means that the to_rdf function ignores them when producing N-Quads output. This is a showstopper for https://github.com/RDFLib/rdflib/issues/2308.

The JSON-LD spec does allow for a means to prevent expansion against a base by setting @base to null (see https://www.w3.org/TR/json-ld/#base-iri) but does not specify that null is the default.

This violates test t0060 in and t0060.

The output should be something similar to (with a different application-specific base):

[
  {
    "@id": "https://w3c.github.io/json-ld-api/tests/document-relative",
    "@type": [ "https://w3c.github.io/json-ld-api/tests/expand/0060-in.jsonld#document-relative" ],
    "http://example.com/vocab#property": [
      {
        "@id": "http://example.org/document-base-overwritten",
        "@type": [ "http://example.org/test/#document-base-overwritten" ],
        "http://example.com/vocab#property": [
          {
            "@id": "https://w3c.github.io/json-ld-api/tests/document-relative",
            "@type": [ "https://w3c.github.io/json-ld-api/tests/expand/0060-in.jsonld#document-relative" ]
          },
          {
            "@id": "../document-relative",
            "@type": [ "#document-relative" ],
            "http://example.com/vocab#property": [ { "@value": "only @base is cleared" } ]
          }
        ]
      }
    ]
  }
]

But the output of pyld is:

  {
    "@id": "../document-relative",
    "@type": [
      "#document-relative"
    ],
    "http://example.com/vocab#property": [
      {
        "@id": "http://example.org/document-base-overwritten",
        "@type": [
          "http://example.org/test/#document-base-overwritten"
        ],
        "http://example.com/vocab#property": [
          {
            "@id": "../document-relative",
            "@type": [
              "#document-relative"
            ]
          },
          {
            "@id": "../document-relative",
            "@type": [
              "#document-relative"
            ],
            "http://example.com/vocab#property": [
              {
                "@value": "only @base is cleared"
              }
            ]
          }
        ]
      }
    ]
  }
]

The resulting N-Quads only returns a single triple:

http://example.org/document-base-overwritten> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/test/#document-base-overwritten> .

This is not a duplicate of #143 as that issue is about a case where the @base is specified.

The problem appears to reside here:

https://github.com/digitalbazaar/pyld/blob/316fbc2c9e25b3cf718b4ee189012a64b91f17e7/lib/pyld/jsonld.py#L3186-L3202

Where in absence of a@base (or an explicit null base, see https://www.w3.org/TR/json-ld/#base-iri) a default base needs to be set.

RinkeHoekstra commented 10 months ago

I started wondering why the test suite doesn't pick this up, and the explanation is in the runtests.py file:

https://github.com/digitalbazaar/pyld/blob/316fbc2c9e25b3cf718b4ee189012a64b91f17e7/tests/runtests.py#L259-L264

Because the manifest files specify a baseIRI value, the test will always run with a base specified. This means that the situation reported in this issue is not recognised.

Rewriting the test is not an option as with an unspecified base IRI, the output will be application specific.