RDFLib / pySHACL

A Python validator for SHACL
Apache License 2.0
245 stars 63 forks source link

pySHACL cannot read resources from chunked HTTPResponses #166

Closed volkerjaenisch closed 1 year ago

volkerjaenisch commented 1 year ago

Dear PySHACL Developers!

PySHACL fails to open a http resource if it is chunked. I propose a quickfix and additional info below the stacktrace.

Code to reproduce:

from pyshacl.rdfutil import load_from_source

load_from_source('http://publications.europa.eu/resource/dataset/planned-availability')

Stacktrace:

Traceback (most recent call last):
  File "/usr/lib/python3.9/xml/sax/expatreader.py", line 217, in feed
    self._parser.Parse(data, isFinal)
xml.parsers.expat.ExpatError: syntax error: line 1, column 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/volker/workspace/PYTHON2/shacl/sandbox/error2.py", line 3, in <module>
    load_from_source('http://publications.europa.eu/resource/dataset/planned-availability')
  File "/home/volker/workspace/venvs/shacl-WIAd6my0-py3.9/lib/python3.9/site-packages/pyshacl/rdfutil/load.py", line 352, in load_from_source
    target_g.parse(source=cast(IO[bytes], _source), format=rdf_format, publicID=public_id)
  File "/home/volker/workspace/venvs/shacl-WIAd6my0-py3.9/lib/python3.9/site-packages/rdflib/graph.py", line 1330, in parse
    parser.parse(source, self, **args)
  File "/home/volker/workspace/venvs/shacl-WIAd6my0-py3.9/lib/python3.9/site-packages/rdflib/plugins/parsers/rdfxml.py", line 604, in parse
    self._parser.parse(source)
  File "/usr/lib/python3.9/xml/sax/expatreader.py", line 111, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/usr/lib/python3.9/xml/sax/xmlreader.py", line 125, in parse
    self.feed(buffer)
  File "/usr/lib/python3.9/xml/sax/expatreader.py", line 221, in feed
    self._err_handler.fatalError(exc)
  File "/usr/lib/python3.9/xml/sax/handler.py", line 38, in fatalError
    raise exception
xml.sax._exceptions.SAXParseException: <unknown>:1:0: syntax error

Process finished with exit code 1

The case of the problem is in load.py : 161

                filename = resp.geturl()
                fp = resp.fp  # type: BufferedIOBase
                source_was_open = False
                source = open_source = fp

Here the filepointer resp.fp of the response is used as source for the parser. This goes well as long as the response is not chunked.

If the response is chunked there is a difference between

resp.read(20) b'<rdf:RDF\n xmlns:r'

and

resp.fp.read(20) b'3ae7\r\n<rdf:RDF\n x'

This is due to the fact that fp is the low level filepointer, which has to be used differently with chunking. b'3ae7\r \n' is position of the next chunk.

Patching load.py : 161 to

fp = resp solves the problem for this case.

But this code is complex and maybe some other use cases do need the use of the filepointer.

Cheers,

Volker

ashleysommer commented 1 year ago

Thank you. This is indeed a bug, I never tested this code with chunked HTTP responses. I will incorporate this fix into the next version of PySHACL.

ashleysommer commented 1 year ago

Hi @volkerjaenisch The latest v0.21.0 release of PySHACL is finally released, that contains a fix for this issue. Sorry for such a long delay on this fix.