RDFLib / rdflib-jsonld

JSON-LD parser and serializer plugins for RDFLib
Other
280 stars 71 forks source link

HTTP Error 500: Internal Server Error => JSONDecodeError #84

Open teledyn opened 4 years ago

teledyn commented 4 years ago

"it worked yesterday!"

I have a simple json-ld string to load into a Graph and it never occurred to me that this should require urllib, but apparently it does, and I'm guessing the target site is down:

rev="""[  {    "@context": "http://schema.org/",    "@type": "Review",    "@id": "ACC-114341792",
    "name": "Classic and unique",    "author": {      "@type": "Person",      "name": "Ted"    },
    "reviewRating": {      "@type": "Rating",      "ratingValue": 5,    },    "reviewBody": "Classic but unique."  }]"""
rg = Graph().parse(data=rev, format="json-ld", publicID='https://www.example.com/')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "./lib/python3.7/site-packages/rdflib/graph.py", line 1043, in parse
    parser.parse(source, self, **args)
  File "./lib/python3.7/site-packages/rdflib_jsonld/parser.py", line 95, in parse
    to_rdf(data, conj_sink, base, context_data)
  File "./lib/python3.7/site-packages/rdflib_jsonld/parser.py", line 107, in to_rdf
    return parser.parse(data, context, dataset)
  File "./lib/python3.7/site-packages/rdflib_jsonld/parser.py", line 140, in parse
    self._add_to_graph(dataset, graph, context, node, topcontext)
  File "./lib/python3.7/site-packages/rdflib_jsonld/parser.py", line 152, in _add_to_graph
    context = context.subcontext(l_ctx)
  File "./lib/python3.7/site-packages/rdflib_jsonld/context.py", line 65, in subcontext
    ctx.load(source)
  File "./lib/python3.7/site-packages/rdflib_jsonld/context.py", line 200, in load
    self._prep_sources(base, source, sources)
  File "./lib/python3.7/site-packages/rdflib_jsonld/context.py", line 213, in _prep_sources
    source = source_to_json(source_url)
  File "./lib/python3.7/site-packages/rdflib_jsonld/util.py", line 23, in source_to_json
    source = create_input_source(source, format='json-ld')
  File "./lib/python3.7/site-packages/rdflib/parser.py", line 186, in create_input_source
    input_source = URLInputSource(absolute_location, format)
  File "./lib/python3.7/site-packages/rdflib/parser.py", line 106, in __init__
    file = urlopen(req)
  File "/usr/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.7/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/usr/lib/python3.7/urllib/request.py", line 641, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.7/urllib/request.py", line 563, in error
    result = self._call_chain(*args)
  File "/usr/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.7/urllib/request.py", line 755, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
  File "/usr/lib/python3.7/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/usr/lib/python3.7/urllib/request.py", line 641, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.7/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.7/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 500: Internal Server Error

only looking at line 186 (naively) in parser.py, it looks like it is taking the data value as a URL?

or am I missing something else? I found reference to the HTTP 500 in the context of https://github.com/linkeddata/rdflib.js/issues/364 which points to a fetch from solid, but I don't know enough about rdflib internals to say for sure.

teledyn commented 4 years ago

it seems this issue started at 20:17.49 EDT May 18th and has broken all of our applications using the json-ld plugin. Is it possible that our repeated use of this plugin has caused us to be banned somewhere?

teledyn commented 4 years ago

as of 13:21.39 EDT May 19th, it appears to be back online, whatever it was, the very same code as above now produces a graph as expected.

So clearly whatever this is, perhaps I need to prevent this fetch by providing my own local copy?

teledyn commented 4 years ago

and my day just keeps getting worse ...

>>> data = '{"@context": "http://schema.org/", "@type": "AggregateRating", "ratingValue": 5, "reviewCount": 1, "bestRating": 5, "@id": "FOO#Rating"}'
>>> g = Graph().parse(data=data, format="json-ld", context="https://www.example.com")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "./lib/python3.7/site-packages/rdflib/graph.py", line 1043, in parse
    parser.parse(source, self, **args)
  File ".lib/python3.7/site-packages/rdflib_jsonld/parser.py", line 95, in parse
    to_rdf(data, conj_sink, base, context_data)
  File "./lib/python3.7/site-packages/rdflib_jsonld/parser.py", line 104, in to_rdf
    context.load(context_data)
  File "./lib/python3.7/site-packages/rdflib_jsonld/context.py", line 200, in load
    self._prep_sources(base, source, sources)
  File "./lib/python3.7/site-packages/rdflib_jsonld/context.py", line 213, in _prep_sources
    source = source_to_json(source_url)
  File "./lib/python3.7/site-packages/rdflib_jsonld/util.py", line 28, in source_to_json
    return json.load(StringIO(stream.read().decode('utf-8')))
  File "/usr/lib/python3.7/json/__init__.py", line 296, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
  File "/usr/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.7/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

it could be blindness from fatigue, but I'm nearly certain those two lines would have worked before.

LidaPetr commented 4 years ago

I have exactly the same issue. Worked fine yesterday, and today got the same errors as you.

teledyn commented 4 years ago

it has to be that mystery file that the plugin secretly fetches, some sort of definition spec for json-ld that was lost in their Monday crash and is now corrupted?

I am using rdflib==4.2.2 rdflib-jsonld==0.5.0 Python 3.7.3 (default, Oct 7 2019, 12:56:13) [GCC 8.3.0] on linux (ubuntu 19.10)

but none of this information has changed since last week -- strace does not appear to show any fetch of a remote file (that I can see) but the HTTP 500 errors overnight Monday do seem to indicate otherwise?

here is my test script, I inserted json.loads to verify that the string is indeed json-compatible, but also to show that it isn't failing because of the rdflib-jsonld use of the json module; I also changed the context to a nonsense but valid url just in case example.com was triggering someting

from rdflib import Graph
from rdflib import plugins  # required for json-ld
import json

data = '{"@context": "http://schema.org/", "@type": "AggregateRating", "ratingValue": 5, "reviewCount": 1, "bestRating": 5, "@id": "FOO#Rating"}'
print(json.loads(data))
g = Graph().parse(data=data, format="json-ld", context="https://www.google.com")
for row in g:
    print(row)

This presently outputs the dict as expected, but followed by a failure in the json decoder.py

{'@context': 'http://schema.org/', '@type': 'AggregateRating', 'ratingValue': 5, 'reviewCount': 1, 'bestRating': 5, '@id': 'FOO#Rating'}
Traceback (most recent call last):
  File "./jsontest.py", line 9, in <module>
    g = Graph().parse(data=data, format="json-ld", context="https://www.google.com")
  File "./lib/python3.7/site-packages/rdflib/graph.py", line 1043, in parse
    parser.parse(source, self, **args)
  File "./lib/python3.7/site-packages/rdflib_jsonld/parser.py", line 95, in parse
    to_rdf(data, conj_sink, base, context_data)
  File "./lib/python3.7/site-packages/rdflib_jsonld/parser.py", line 104, in to_rdf
    context.load(context_data)
  File "./lib/python3.7/site-packages/rdflib_jsonld/context.py", line 200, in load
    self._prep_sources(base, source, sources)
  File "./lib/python3.7/site-packages/rdflib_jsonld/context.py", line 213, in _prep_sources
    source = source_to_json(source_url)
  File "./lib/python3.7/site-packages/rdflib_jsonld/util.py", line 28, in source_to_json
    return json.load(StringIO(stream.read().decode('utf-8')))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 9718: invalid continuation byte

position 9718??

hsolbrig commented 4 years ago

This is an error on the schema.org site. See: https://tinyurl.com/kk7k5qt (A link to the JSON-LD playground) that uses schema.org as well. Has anyone reported this issue to them???

teledyn commented 4 years ago

@hsolbrig I'm curious, how do you know it is an error on the schema.org site? I don't know the internals of this thing very well (but apparently now is the time to learn) -- I couldn't find an easy means to contact json-ld.org, but I am in their IRC channel right now; there are 21 participants, but I fear these days IRC isn't first on anyone's list as a place to check

teledyn commented 4 years ago

dlongley at json-ld provides the following:

curl -v -H "accept: application/ld+json" "http://schema.org"

is returning nonsense

teledyn commented 4 years ago

They are aware: https://github.com/schemaorg/schemaorg/issues/2578

dlongley commented 4 years ago

Note: Libraries/applications should be either caching or installing a version of the schema.org JSON-LD context and loading it locally. A language/ecosystem appropriate package manager could then be used to manage updates to the context. This would mitigate this problem and improve performance for users.

teledyn commented 4 years ago

In a way I am thankful for their downtime overnight Monday because I was unaware that my code was hitting some service for every single json-ld parse job, and in some runs, I can be doing thousands in a very short time. So yes, a Conditional-GET might be in order, done during init, or done once on the first invocation and cached for the lifespan of the process, and if it fails, fall back to a local filesystem cached copy -- after years of using schema.org and rdflib-jsonld, I guess I had never tried to run one on a disconnected machine.

schemaorg/schemaorg posted a merge into master that roughly correlates with when the HTTP 500 errors stopped and the parsing errors began

update at 16:20 EDT: that curl test now returns HTML again (for a while it was returning binary)

teledyn commented 4 years ago

as we move into day 3 of this outage, and this is probably a naive question but @dlongley is there any alternate source to obtain that context so we can work towards getting our applications back online?

dlongley commented 4 years ago

@teledyn,

If you run this:

curl "https://schema.org/docs/jsonldcontext.jsonld"

You should get the latest schema.org context.

teledyn commented 4 years ago

thanks again -- I see today on the schema.org mailing list that this change in the accept header was intentional, a consequence of blocking a DoS attack, which was likely those HTTP 500 errors I saw Monday night.

Dan Brickley writes:

We expect to replace the HTTP content negotiation with the use of a Link header as specified in the latest JSON-LD specs i.e. Link: </docs/jsonldcontext.jsonld>; rel="alternate"; type="application/ld+json" ...with a corresponding to CORS too I'm told there are at least 5 JSON-LD implementations that pass the test for this feature, see https://w3c.github.io/json-ld-api/tests/remote-doc-manifest#tla03

teledyn commented 4 years ago

Work-around:

I have an ugly fix for those who need this working quickly, I won't submit a PR because this isn't a fix, it's a hack that (in my case) works:

--- lib/python3.7/site-packages/rdflib_jsonld/context.py~   2020-05-21 13:57:18.086689949 -0400
+++ lib/python3.7/site-packages/rdflib_jsonld/context.py    2020-05-21 14:21:43.238345583 -0400
@@ -207,6 +207,8 @@
         for source in inputs:
             if isinstance(source, str):
                 source_url = urljoin(base, source)
+                if "/schema.org" in source_url:
+                    source_url = "jsonldcontext.jsonld"
                 if source_url in referenced_contexts:
                     raise errors.RECURSIVE_CONTEXT_INCLUSION
                 referenced_contexts.add(source_url)

save that as jsonfix.patch then (in linux) I do patch -p0 < jsonfix.patch from my base directory and my above short test, when I remove my (fake) context value, now works as expected. This requires the jsonldcontext.jsonld file to be placed in the current directory.

Improvements are very much appreciated

datadavev commented 4 years ago

A more general solution may be to support Link headers [1, 2] for resolving the location of external contexts (when not available in a local cache). See also #85.

When parsing JSON-LD, redirect handling actually happens in rdflib in rdflib.parser.URLInputSource.

The parser uses rdflib.parser.create_input_source() (called from util.source_to_json) to get an input source from a URL (such as the context location URL). That in turn relies on rdflib.parser.URLInputSource() which returns an open stream. URLInputSource uses urllib to create the request and return the stream (see line 115). urllib will internally handle redirects which worked previously, but does not now that schema.org is using Link headers.

I made a PR for supporting link headers (patch available at https://patch-diff.githubusercontent.com/raw/RDFLib/rdflib/pull/1125.patch ) for the specific case of JSON-LD parsing, though this would generally change the behavior of URLInputSource when a Link header is present for the json-ld format. If that is undesirable then a more invasive approach could add an optional parameter indicating whether to follow link headers could be added to URLInputSource (which could be set when calling parse on remote context URLs) .

This change resolved schema.org context parsing for me, and also works with other context documents referenced through link headers.

[1] https://tools.ietf.org/html/rfc8288 [2] https://w3c.github.io/json-ld-api/#remote-document-and-context-retrieval