digitalbazaar / pyld

JSON-LD processor written in Python
https://json-ld.org/
Other
606 stars 131 forks source link

pyld does not inspect Link headers #128

Open alpha-beta-soup opened 4 years ago

alpha-beta-soup commented 4 years ago

Let's say I have this extremely minimal bit of JSON-LD to be expanded with pyld:

>>> import pyld
>>> d = {
...    "@context": "https://schema.org",
...    "@type":"Dataset",
...    "@id":"http://localhost:5000/collections/obs",
...    "url":"http://localhost:5000/collections/obs"
... }
>>> pyld.expand(d)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'pyld' has no attribute 'expand'
>>> pyld.jsonld.expand(d)
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/pyld/documentloader/requests.py", line 72, in loader
    'document': response.json()
  File "/usr/local/lib/python3.8/dist-packages/requests/models.py", line 898, in json
    return complexjson.loads(self.text, **kwargs)
  File "/usr/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.8/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/pyld/context_resolver.py", line 143, in _fetch_context
    remote_doc = jsonld.load_document(url,
  File "/usr/local/lib/python3.8/dist-packages/pyld/jsonld.py", line 6583, in load_document
    remote_doc = options['documentLoader'](url, options)
  File "/usr/local/lib/python3.8/dist-packages/pyld/documentloader/requests.py", line 100, in loader
    raise JsonLdError(
pyld.jsonld.JsonLdError: ('Could not retrieve a JSON-LD document from the URL.',)
Type: jsonld.LoadDocumentError
Code: loading document failed
Cause: Expecting value: line 1 column 1 (char 0)  File "/usr/local/lib/python3.8/dist-packages/pyld/documentloader/requests.py", line 72, in loader
    'document': response.json()
  File "/usr/local/lib/python3.8/dist-packages/requests/models.py", line 898, in json
    return complexjson.loads(self.text, **kwargs)
  File "/usr/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.8/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/dist-packages/pyld/jsonld.py", line 163, in expand
    return JsonLdProcessor().expand(input_, options)
  File "/usr/local/lib/python3.8/dist-packages/pyld/jsonld.py", line 870, in expand
    expanded = self._expand(active_ctx, None, document, options,
  File "/usr/local/lib/python3.8/dist-packages/pyld/jsonld.py", line 2302, in _expand
    active_ctx = self._process_context(
  File "/usr/local/lib/python3.8/dist-packages/pyld/jsonld.py", line 3049, in _process_context
    resolved = options['contextResolver'].resolve(active_ctx, local_ctx, options.get('base', ''))
  File "/usr/local/lib/python3.8/dist-packages/pyld/context_resolver.py", line 58, in resolve
    resolved = self._resolve_remote_context(
  File "/usr/local/lib/python3.8/dist-packages/pyld/context_resolver.py", line 108, in _resolve_remote_context
    context, remote_doc = self._fetch_context(active_ctx, url, cycles)
  File "/usr/local/lib/python3.8/dist-packages/pyld/context_resolver.py", line 148, in _fetch_context
    raise jsonld.JsonLdError(
pyld.jsonld.JsonLdError: ('Dereferencing a URL did not result in a valid JSON-LD object. Possible causes are an inaccessible URL perhaps due to a same-origin policy (ensure the server uses CORS if you are using client-side JavaScript), too many redirects, a non-JSON response, or more than one HTTP Link Header was provided for a remote context.',)
Type: jsonld.InvalidUrl
Code: loading remote context failed
Details: {'url': 'https://schema.org', 'cause': JsonLdError('Could not retrieve a JSON-LD document from the URL.')}

If I susbtitute "https://schema.org" with "https://schema.org/docs/jsonldcontext.jsonld", with the code otherwise unchanged, it will correctly print (as I expected):

>>> [{'@id': 'http://localhost:5000/collections/obs', '@type': ['http://schema.org/Dataset'], 'http://schema.org/url': [{'@id': 'http://localhost:5000/collections/obs'}]}]

However, that then seems to mess up other parsers, including the Google Structured Data Testing Tool.

The root issue seems to be with pyld's remote fetching of contexts, in that "https://schema.org/" does not now have an application/ld+json content-type, instead opting to use Link header with rel=alternate and type=application/ld+json. It seems that pyld needs to be updated to handle that case:

$ curl -I https://schema.org/ 
HTTP/2 200 
access-control-allow-credentials: true
access-control-allow-headers: Accept
access-control-allow-methods: GET
access-control-allow-origin: *
access-control-expose-headers: Link
link: </docs/jsonldcontext.jsonld>; rel="alternate"; type="application/ld+json"
date: Fri, 19 Jun 2020 03:17:19 GMT
expires: Fri, 19 Jun 2020 03:27:19 GMT
etag: "G8zMyg"
x-cloud-trace-context: d2d5c536d73ce1590813f8e1018a2ad6
content-type: text/html
server: Google Frontend
content-length: 5100
age: 73
cache-control: public, max-age=600
alt-svc: h3-28=":443"; ma=2592000,h3-27=":443"; ma=2592000,h3-25=":443"; ma=2592000,h3-T050=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q049=":443"; ma=2592000,h3-Q048=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"

If you do curl https://schema.org/ -H "Accept: application/ld+json" you will still get back an HTML response.

Perhaps the cleanest way to implement this would be to check if a non-JSON-LD response is recieved, and if so, to look for an appropriate Link header and then make a request there.

alpha-beta-soup commented 4 years ago

After reviewing the source, it doesn't seem to be the case that pyld doesn't inspect Link headers, but that it does response.json(), triggering an exception right before the Link header would be inspected, so it never gets that far. This can possibly be avoided by first checking whether the Content-Type is some kind of JSON (since https://schema.org will response with HTML). The error suggests that for whatever reason, at the point of the exception, response is None.

pyld.jsonld.JsonLdError: ('Dereferencing a URL did not result in a valid JSON-LD object. Possible causes are an inaccessible URL perhaps due to a same-origin policy (ensure the server uses CORS if you are using client-side JavaScript), too many redirects, a non-JSON response, or more than one HTTP Link Header was provided for a remote context.',)

Why require JSON repsonses if the Link of type alternate is intented to point to the alternate representation? https://html.spec.whatwg.org/multipage/links.html#rel-alternate

If the alternate keyword is used with the type attribute, it indicates that the referenced document is a reformulation of the current document in the specified format.

davidlehn commented 4 years ago

The Link handling code is in the document loaders right below where that json() call happens. Quite possible that code hadn't been properly tested before. If someone has time to refactor that code to handle Link header in the proper order, that would be great.

alpha-beta-soup commented 4 years ago

@davidlehn I'm trying to learn how the tests are put together, to get a clear failing case before trying to fix the issue. If you can help with that, I'm willing to try and fix it.

I have the existing test suite running (although I get five failures). To that I've added a manifest.json in the root, and two test cases at the root as well.

manifest.json:

{
  "@context": ["context.jsonld", {"@base": "manifest"}],
  "@id": "",
  "@type": "mf:Manifest",
  "name": "JSON-LD Test Suite",
  "description": "This manifest loads some tests for resolving https://github.com/digitalbazaar/pyld/issues/128",
  "sequence": [
    "sample.jsonld",
    "sample2.jsonld"
  ]
}

sample.jsonld

{
    "@context": "https://schema.org",
    "@type":"Dataset",
    "@id":"http://localhost:5000/collections/obs",
    "url":"http://localhost:5000/collections/obs"
}

sample2.jsonld

{
    "@context": "https://schema.org/docs/jsonldcontext.jsonld",
    "@type":"Dataset",
    "@id":"http://localhost:5000/collections/obs",
    "url":"http://localhost:5000/collections/obs"
}

I run the tests in a virtual environment as: python tests/runtests.py ./manifest.jsonld, but the test suite skips them:

/usr/lib/python3/dist-packages/requests/__init__.py:80: RequestsDependencyWarning: urllib3 (1.24.1) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
PyLD Tests
Use -h or --help to view options.

JSON-LD Test Suite: http://localhost:5000/collections/obs: None ... skipped "Test type of ['Dataset']"
JSON-LD Test Suite: http://localhost:5000/collections/obs: None ... skipped "Test type of ['Dataset']"

----------------------------------------------------------------------
Ran 2 tests in 0.000s

OK (skipped=2)

How can I test these?

mathiasrichter commented 2 years ago

The Link handling code is in the document loaders right below where that json() call happens. Quite possible that code hadn't been properly tested before. If someone has time to refactor that code to handle Link header in the proper order, that would be great.

Wouldn't a possible fix be as follows:

After performing the initial request which returns a response with alternate link headers:

            if response.headers['Link']:
                links = response.links
                if links['alternate'] and links['alternate']['type'] == 'application/ld+json':
                    response = requests.get(response.url+links['alternate']['url'], headers=headers, **kwargs)