digitalbazaar / pyld

JSON-LD processor written in Python
https://json-ld.org/
Other
605 stars 129 forks source link

Could not retrieve a JSON-LD document from the URL. #133

Open sharpaper opened 4 years ago

sharpaper commented 4 years ago

For some reasons pyld is not able to fetch the context URL https://www.w3.org/ns/activitystreams. I was able to reduce the issue to the following basic example, which is not working:

import pyld
doc = {
    "@context": [ 'https://www.w3.org/ns/activitystreams' ],
    "type": "Follow" }
doc = pyld.jsonld.expand(doc)

[...]
raise JSONDecodeError("Expecting value", s, err.value) from None
[...]
Dereferencing a URL did not result in a valid JSON-LD object.
[...]
Type: jsonld.InvalidUrl
Code: loading remote context failed
Details: {'url': 'https://www.w3.org/ns/activitystreams', 'cause': JsonLdError('Could not retrieve a JSON-LD document from the URL.')}

I could not debug the issue. It looks like pyld might be retrieving the context without the correct headers, but looking at the code I do see headers = { 'Accept': ...} defined in several places. Can you guys please help me understand if this is a bug or if I'm not using the library correctly? Thanks!

sharpaper commented 4 years ago

If this can be any useful, when the "document loader" sends the .get() request to retrieve the remote context, see L63, the parameters are:

url = https://www.w3.org/ns/activitystreams
headers = {'Accept': 'application/ld+json;profile=http://www.w3.org/ns/json-ld#context, application/ld+json, application/json;q=0.5, text/html;q=0.8, application/xhtml+xml;q=0.8'}
kwargs = {}

Because the request accepts everything, the remote has selected HTML and therefore pyld gives the error. curl-ing the URI with 'Accept': 'application/ld+json;profile=http://www.w3.org/ns/json-ld#context, application/ld+json, application/json;q=0.5', after removing the HTML option from Accept, does work. So I think the website might be at fault here because application/ld+json should take higher precedence over text/html;q=0.8, but on the other hand why is pyld requesting HTML? Can HTML be removed from the Accept headers somehow?

sharpaper commented 4 years ago

Changing the q value of text/html here to 0.5, that is text/html;q=0.5, it fetches the document correctly. Anything above 0.5 and it doesn't work.

What the heck is going on? I don't understand.

sharpaper commented 4 years ago

Can text/html and application/xhtml+xml be removed entirely from the Accept header? Why are they needed? Shouldn't objects always be retrieved with application/ld+json?

gkellogg commented 4 years ago

W3C have had issues with their content negotiation setup before. Certainly, changing the priority for HTML could be a workaround, but the Accept header is fine.

The text/html is included, because a processor can extract JSON-LD from html, which is arguably the bulk of JSON-LD on the web.

@iherman might have a look at the server configuration for activitystreams.

sharpaper commented 4 years ago

Unfortunately the text/html option is hard coded into the load_document function but this issue could be circumvented if issue #125 was fixed, as it would allow to configure custom headers during creation of the document loader.

sharpaper commented 4 years ago

OK I was able to figure out a workaround with a custom loader, but issue #125 should really be fixed because it would make this process a lot simpler by allowing to specify headers directly in jsonld.set_document_loader(jsonld.requests_document_loader(timeout=..., headers=...)).

def myloader(*args, **kwargs):
    requests_loader = pyld.documentloader.requests.requests_document_loader(*args, **kwargs)

    def loader(url, options={}):
        options['headers']['Accept'] = 'application/ld+json'
        return requests_loader(url, options)

    return loader

pyld.jsonld.set_document_loader(myloader())
iherman commented 4 years ago

The activitystreams.var file on the W3C site is as follows:

URI: activitystreams

URI: activitystreams.html
Content-Type: text/html

URI: activitystreams.jsonld
Content-Type: application/ld+json; qs=0.5

URI: activitystreams.jsonld
Content-Type: application/json; qs=0.4

this looks o.k. to me...

cc @gkellogg

sharpaper commented 4 years ago

Is it possible that there is some kind of weights that are been evaluated before choosing the response content-type? I have no idea how these q and qs properties are used in practice, but maybe Apache is computing a "score" for each type? Something like... if the request is

'Accept': 'application/ld+json;profile=http://www.w3.org/ns/json-ld#context, application/ld+json, application/json;q=0.5, text/html;q=0.8, application/xhtml+xml;q=0.8'

and the score is q x qs, then:

text/html = 0.8 x 1.0 = 0.8 application/ld+json = 1.0 x 0.5 = 0.5 application/json = 0.5 x 0.4 = 0.2

This would explain why any value in the Accept header for text/html greater than 0.5 would fail to retrieve the jsonld document, since N x 1.0 = N is always greater (and thus higher priority) than the application/ld+json "score" of 0.5.

If this hypothesis is true, then I must say that it's a very messy situation because the client cannot adjust its weights for every website.

iherman commented 4 years ago

To be honest: I do not know either. Maybe somebody with a better knowledge of how Apache works can advise.

sharpaper commented 4 years ago

I'm thinking my hypothesis is indeed true. After taking a look at httpd source I've found two RFC (2295 and 2296) that say "The overall quality Q of a variant is the value Q = round5( qs qt qc ql qf )" where all the qs are various quality values. Note that the httpd source code says in the comments that all the quality values are taken from the request headers except for qs. Then there's also the Apache Negotiation Algorithm which I think it may be a slightly modified version of the one described in the RFC; anyway the step 2.1 of the algorithm is literally "Multiply the quality factor from the Accept header with the quality-of-source factor for this variants media type, and select the variants with the highest value.".

So the bottom line is that this probably has to be fixed in pyld. In particular I think it's an issue with the requests loader. If requests cannot accept text/html, then it should either replace the header with its own "application/ld+json" or as I said just fix #125 such that the headers can be defined by the user when creating a new loader.

BigBlueHat commented 8 months ago

@iherman I believe you fixed this at the W3C recently, iirc (or was it for a different context file)?

iherman commented 8 months ago

@iherman I believe you fixed this at the W3C recently, iirc (or was it for a different context file)?

Almost 😀. The settings have been changed, but not by me; the culpit is @pchampin

pchampin commented 8 months ago

@iherman I believe you fixed this at the W3C recently, iirc (or was it for a different context file)?

I confirm that this has been fixed