Problem parsing a file with \xc3\xa9

moissinac commented 5 years ago

I have a big file (800 Mb) with lines like this one: <https://data.datatourisme.gouv.fr/fc3d6762-6097-3d12-945d-d008e3e038cd> <https://data.datatourisme.gouv.fr/fc3d6762-6097-3d12-945d-d008e3e038cd> <http://purl.org/dc/elements/1.1/description> "Meubl\xc3\xa9 de tourisme de grande qualit\xc3\xa9 am\xc3\xa9nag\xc3\xa9 dans une grange typique du secteur. grand s\xc3\xa9jour, cuisine, 6 chambres, 7 salles d\'eau, 7 WC. Piscine privative, sauna, jacuzzi, salle de jeux avec billard."@fr .<https://data.datatourisme.gouv.fr/38/c90d3edd-7fed-3849-ad54-f9522861f89a> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.datatourisme.fr/ontology/core/1.0#PointOfInterest> .

rdflib 4.2.2 fails to parse my big file. I've created a file with just the previous line to try to understand the problem (and possibly to workaround) (file http://givingsense.eu/moissinac/pourTestCopieTeralab2.nt) rdflib 4.2.2 fails to parse the file with just the previous line. RDFTranslator, which use rdflib 4.1.2, fails on the first \xc3\xa9 sequence. Until now, I'm unable to solve the problem. I'm parsing the file with the line srcGraph.parse(filepath, format="nt")

tgbugs commented 5 years ago

Which version of python?

moissinac commented 5 years ago

python3

AxelNennker commented 5 years ago

What I downloaded from (file http://givingsense.eu/moissinac/pourTestCopieTeralab2.nt) was all triples in one line. This can't work. Splitting the triples in lines makes this parsable

<https://data.datatourisme.gouv.fr/fc3d6762-6097-3d12-945d-d008e3e038cd> <http://purl.org/dc/elements/1.1/description> "Meubl\u00E9 de tourisme de grande qualit\u00E9 am\u00E9nag\u00E9 dans une grange typique du secteur. grand s\u00E9jour, cuisine, 6 chambres, 7 salles d\u0027eau, 7 WC. Piscine privative, sauna, jacuzzi, salle de jeux avec billard."@fr .
<https://data.datatourisme.gouv.fr/38/c90d3edd-7fed-3849-ad54-f9522861f89a> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.datatourisme.fr/ontology/core/1.0#PointOfInterest> .

I think the problem with the file is that after the '.' at the end of the triple there is no whitespace before the next triple.

import codecs
from rdflib import Graph
from rdflib.plugins.parsers.ntriples import NTriplesParser

graph = Graph()
with codecs.open('pourTestCopieTeralab2.nt', 'rb') as f:
    p = NTriplesParser()
    sink = p.parse(f)  # file; use parsestring for a string

Replacing \xc3\xa9 by \u00E9 does not change a thing.

ghost commented 2 years ago

The contents of the URL have shrunk a bit but still exhibit the same problem.

The W3 Ntriples specification is clear: “N-Triples triples are a sequence of RDF terms representing the subject, predicate and object of an RDF Triple. These may be separated by white space (spaces U+0020 or tabs U+0009). This sequence is terminated by a '.' and a new line (optional at the end of a document).”

Because of lack of new line separators, the URL content is not ntriples format, so a parse failure is to be expected and is not really an RDFLib issue per se, so closing.

RDFLib / rdflib

Problem parsing a file with \xc3\xa9 #878