RDFLib / rdflib

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.
https://rdflib.readthedocs.org
BSD 3-Clause "New" or "Revised" License
2.15k stars 555 forks source link

raise BadSyntax(self._thisDoc, self.lines, argstr, i, msg) rdflib.plugins.parsers.notation3.BadSyntax: #1715

Closed keloemma closed 2 years ago

keloemma commented 2 years ago

I tried to parse a file in turtle format but I am getting this error and cannot find how to solve it :

link to the database : http://kaiko.getalp.org/about-dbnary/download/

   Traceback (most recent call last):
  File "/gpfs7kw/linkhome/rech/genlig01//test/expe_5/dbnary_corpus/extract.py", line 25, in <module>
    result = g.parse('fr_dbnary_ontolex.ttl', format='n3')
  File "/linkhome/rech/genlig01//.conda/envs/bert/lib/python3.9/site-packages/rdflib/graph.py", line 1078, in parse
    parser.parse(source, self, **args)
  File "/linkhome/rech/genlig01//.conda/envs/bert/lib/python3.9/site-packages/rdflib/plugins/parsers/notation3.py", line 1915, in parse
    TurtleParser.parse(self, source, conj_graph, encoding, turtle=False)
  File "/linkhome/rech/genlig01/u/.conda/envs/bert/lib/python3.9/site-packages/rdflib/plugins/parsers/notation3.py", line 1886, in parse
    p.loadStream(source.getByteStream())
  File "/linkhome/rech/genlig01/u/.conda/envs/bert/lib/python3.9/site-packages/rdflib/plugins/parsers/notation3.py", line 442, in loadStream
    return self.loadBuf(stream.read())    # Not ideal
  File "/linkhome/rech/genlig01/umg16uw/.conda/envs/bert/lib/python3.9/site-packages/rdflib/plugins/parsers/notation3.py", line 448, in loadBuf
    self.feed(buf)
  File "/linkhome/rech/genlig01//.conda/envs/bert/lib/python3.9/site-packages/rdflib/plugins/parsers/notation3.py", line 474, in feed
    i = self.directiveOrStatement(s, j)
  File "/linkhome/rech/genlig01//.conda/envs/bert/lib/python3.9/site-packages/rdflib/plugins/parsers/notation3.py", line 495, in directiveOrStatement
    j = self.statement(argstr, i)
  File "/linkhome/rech/genlig01//.conda/envs/bert/lib/python3.9/site-packages/rdflib/plugins/parsers/notation3.py", line 733, in statement
    j = self.property_list(argstr, i, r[0])
  File "/linkhome/rech/genlig01//.conda/envs/bert/lib/python3.9/site-packages/rdflib/plugins/parsers/notation3.py", line 1096, in property_list
    self.BadSyntax(argstr, j,
  File "/linkhome/rech/genlig01//.conda/envs/bert/lib/python3.9/site-packages/rdflib/plugins/parsers/notation3.py", line 1623, in BadSyntax
    raise BadSyntax(self._thisDoc, self.lines, argstr, i, msg)
rdflib.plugins.parsers.notation3.BadSyntax: at line 2093929 of <>:
Bad syntax (objectList expected) at ^ in:
"...b'rotecteur."@fr ] .\n\n<http://kaiko.getalp.org/dbnary/fra/__tr'^b'_eng_1_facteur_d\xe2\x80\x99att\xc3\xa9nuation__nom__1>\n        rdf:type   '..."

I made a search in the file but the line is correctly written. So I do not how to dit, I have to parse the file.

here code :

read specific columns of csv file using Pandas

import csv
import os
import pprint

import rdflib
from rdflib import Graph, Literal, RDF, URIRef
# rdflib knows about quite a few popular namespaces, like W3C ontologies, schema.org etc.
from rdflib.namespace import FOAF , XSD

g = rdflib.Graph()

result = g.parse('fr_dbnary_ontolex.ttl', format='n3') # 

q_noun = """

SELECT * WHERE {
       ?lexeme a ontolex:LexicalEntry ;
         rdfs:label ?label ;
         lexinfo:partOfSpeech lexinfo:noun;
         dbnary:synonym   ?syn .
    }

"""
for p, o, s in g.query(q_noun):
      with open("dbnary_synonym.tsv", 'a', encoding='utf-8') as f:
          f.write(p + "\t" + o + "\t" + s + '\n')

I changed the format to turtle but the error is still produced. I checked the file and it correctly written.

file look like this : small view since the file is quite huge.


@prefix dbetym:   <http://etytree-virtuoso.wmflabs.org/dbnaryetymology#> .
@prefix dbnary:   <http://kaiko.getalp.org/dbnary#> .
@prefix dbstats:  <http://kaiko.getalp.org/dbnary/statistics/> .
@prefix dcterms:  <http://purl.org/dc/terms/> .
@prefix decomp:   <http://www.w3.org/2002/07/owl#> .
@prefix fra:      <http://kaiko.getalp.org/dbnary/fra/> .
@prefix lexinfo:  <http://www.lexinfo.net/ontology/2.0/lexinfo#> .
@prefix lexvo:    <http://lexvo.org/id/iso639-3/> .
@prefix lime:     <http://www.w3.org/ns/lemon/lime#> .
@prefix olia:     <http://purl.org/olia/olia.owl#> .
@prefix ontolex:  <http://www.w3.org/ns/lemon/ontolex#> .
@prefix qb:       <http://purl.org/linked-data/cube#> .
@prefix rdf:      <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:     <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos:     <http://www.w3.org/2004/02/skos/core#> .
@prefix synsem:   <http://www.w3.org/ns/lemon/synsem#> .
@prefix vartrans: <http://www.w3.org/ns/lemon/vartrans#> .
@prefix wikt:     <https://fr.wiktionary.org/wiki/> .
@prefix xs:       <http://www.w3.org/2001/XMLSchema#> .
fra:accueil__nom__1  rdf:type  ontolex:Word , ontolex:LexicalEntry ;
        rdfs:label             "accueil"@fr ;
        dbnary:partOfSpeech    "-nom-" ;
        dbnary:synonym         fra:home , fra:main_page , <http://kaiko.getalp.org/dbnary/fra/page_d’accueil> ;
        dcterms:language       lexvo:fra ;
        lexinfo:partOfSpeech   lexinfo:noun ;
        lime:language          "fr" ;
        ontolex:canonicalForm  fra:__cf_accueil__nom__1 ;
        ontolex:sense          fra:__ws_1_accueil__nom__1 , fra:__ws_2_accueil__nom__1 , fra:__ws_3_accueil__nom__1 , fra:__ws_4_accueil__nom__1 , fra:__ws_5_accueil__nom__1 .
fra:__cf_accueil__nom__1
        rdf:type             ontolex:Form ;
        lexinfo:gender       lexinfo:masculine ;
        ontolex:phoneticRep  "a.kœj"@fr-fonipa ;
        ontolex:writtenRep   "accueil"@fr .
fra:lire__verb__1  rdf:type    ontolex:Word , ontolex:LexicalEntry ;
        rdfs:label             "lire"@fr ;
        dbnary:partOfSpeech    "-verb-" ;
        dbnary:synonym         fra:lire ;
        dcterms:language       lexvo:fra ;
        lexinfo:partOfSpeech   lexinfo:verb ;
        lime:language          "fr" ;
        ontolex:canonicalForm  fra:__cf_lire__verb__1 ;
        ontolex:sense          fra:__ws_1_lire__verb__1 , fra:__ws_2_lire__verb__1 , fra:__ws_3_lire__verb__1 , fra:__ws_4_lire__verb__1 , fra:__ws_5_lire__verb__1 , fra:__ws_6_lire__verb__1 , fra:__ws_7_lire__verb__1 , fra:__ws_8_lire__verb__1 , fra:__ws_9_lire__verb__1 .      
fra:meuble__adj__1  rdf:type   ontolex:Word , ontolex:LexicalEntry ;
        rdfs:label             "meuble"@fr ;
        dbnary:antonym         fra:dur , fra:solide , fra:immeuble ;
        dbnary:partOfSpeech    "-adj-" ;
        dcterms:language       lexvo:fra ;
        lexinfo:partOfSpeech   lexinfo:adjective ;
        lime:language          "fr" ;
        ontolex:canonicalForm  fra:__cf_meuble__adj__1 ;
        ontolex:sense          fra:__ws_1_meuble__adj__1 , fra:__ws_2_meuble__adj__1 .
fra:militaire__adj__1
        rdf:type               ontolex:Word , ontolex:LexicalEntry ;
        rdfs:label             "militaire"@fr ;
        dbnary:antonym         fra:civil ;
        dbnary:partOfSpeech    "-adj-" ;
        dbnary:synonym         fra:martial , fra:guerrier ;
        dcterms:language       lexvo:fra ;
        lexinfo:partOfSpeech   lexinfo:adjective ;
        lime:language          "fr" ;
        ontolex:canonicalForm  fra:__cf_militaire__adj__1 ;
        ontolex:sense          fra:__ws_1_militaire__adj__1 , fra:__ws_2_militaire__adj__1 , fra:__ws_3_militaire__adj__1 .

fra:mercredi__adv__1  rdf:type  ontolex:Word , ontolex:LexicalEntry ;
        rdfs:label             "mercredi"@fr ;
        dbnary:partOfSpeech    "-adv-" ;
        dbnary:synonym         fra:civil1 ;
        dcterms:language       lexvo:fra ;
        lexinfo:partOfSpeech   lexinfo:adverb ;
        lime:language          "fr" ;
        ontolex:canonicalForm  fra:__cf_mercredi__adv__1 ;
        ontolex:sense          fra:__ws_1_mercredi__adv__1 .

fra:mercredi__adv__1  rdf:type  ontolex:Word , ontolex:LexicalEntry ;
        rdfs:label             "mercredi"@fr ;
        dbnary:synonym         fra:civil2 ;
        dbnary:partOfSpeech    "-adv-" ;
        dcterms:language       lexvo:fra ;
        lexinfo:partOfSpeech   lexinfo:adverb ;
        lime:language          "fr" ;
        ontolex:canonicalForm  fra:__cf_mercredi__adv__1 ;
        ontolex:sense          fra:__ws_1_mercredi__adv__1 .

<http://kaiko.getalp.org/dbnary/fra/page_d’accueil>
        rdf:type          dbnary:Page ;
        dbnary:describes  <http://kaiko.getalp.org/dbnary/fra/page_d’accueil__nom__1> .

<http://kaiko.getalp.org/dbnary/fra/page_d’accueil>
        rdf:type          dbnary:Page ;
        dbnary:describes  <http://kaiko.getalp.org/dbnary/fra/page_d’accueil__nom__1> .
ghost commented 2 years ago

I checked the file and it correctly written.

I recommend that you use a more capable checker as both RDFLib and fuseki disagree with your statement and with what appears to be good reason:

:: [line: 1533047, col: 73] Illegal character in IRI (Not a ucschar: 0xD83E): <http://kaiko.getalp.org/dbnary/fra/[U+D83E]...>
1533047         dbnary:synonym             <http://kaiko.getalp.org/dbnary/fra/🧙<200d>♀️> ;
keloemma commented 2 years ago

@gjhiggins
Ok I use the ttl file and I donot directly load from internet , it seems the line you pointed is the problematic line but it is and emoji and could only change it inthe ttl file. After that I am getting this error :

http://kaiko.getalp.org/dbnary/fra/ramponeau    http://kaiko.getalp.org/dbnary/fra/coup_de_poing__nom__1        coup de poing does not look like a valid URI, trying to serialize this will break.
http://kaiko.getalp.org/dbnary/fra/ramponeau    http://kaiko.getalp.org/dbnary/fra/coup_de_poing__nom__1        coup de poing
ghost commented 2 years ago

@gjhiggins Ok I use the ttl file and I do not directly load from internet , it seems the line you pointed is the problematic line but it is and emoji and could only change it inthe ttl file. After that I am getting this error :

That's a warning, not an error and if it were the only instance that needed changing, it would likely parse okay.

Unfortunately, the single change you have made is not enough, it's just the beginning ... Fuseki's (trimmed) log advises of (at least) these illegal chars:

[line: 1533047, col: 74]
[line: 1533093, col: 38]
[line: 1533093, col: 39]
[line: 3427344, col: 69]
[line: 3427344, col: 70]
[line: 3427344, col: 114]
[line: 3427344, col: 115]
[line: 3427344, col: 159]
[line: 3427344, col: 160]
[line: 3427438, col: 38]
[line: 3427438, col: 39]
[line: 3427441, col: 38]
[line: 3427441, col: 39]
[line: 3427444, col: 38]
[line: 3427444, col: 39]
[line: 4268570, col: 145]
[line: 4268570, col: 146]
[line: 4268613, col: 38]
[line: 4268613, col: 39]
[line: 4568763, col: 69]
[line: 4568763, col: 70]
[line: 4568803, col: 38]
[line: 4568803, col: 39]
[line: 5149347, col: 121]
[line: 5149347, col: 122]
[line: 7225855, col: 66]
[line: 7225855, col: 67]
[line: 7225867, col: 38]
[line: 7225867, col: 39]
[line: 7249674, col: 69]
[line: 7249674, col: 70]
[line: 7249731, col: 38]
[line: 7249731, col: 39]
[line: 12998536, col: 66]
[line: 12998536, col: 67]
[line: 12998542, col: 38]
[line: 12998542, col: 39]
[line: 14170449, col: 66]
[line: 14170449, col: 67]
[line: 14170452, col: 38]
[line: 14170452, col: 39]
[line: 15303656, col: 77]
[line: 15303656, col: 78]
[line: 15303662, col: 38]
[line: 15303662, col: 39]
[line: 16243319, col: 97]
[line: 16243319, col: 98]
[line: 19308391, col: 66]
[line: 19308391, col: 67]
[line: 19308395, col: 38]
[line: 19308395, col: 39]
[line: 19496314, col: 66]
[line: 19496314, col: 67]
[line: 20058514, col: 66]
[line: 20058514, col: 67]
[line: 20184620, col: 66]
[line: 20184620, col: 67]
[line: 20184624, col: 38]
[line: 20184624, col: 39]

I doubt that list is complete and I'm sorry but I can only suggest that you contact the publishers and inform them of the issue.

aucampia commented 2 years ago

@keloemma please include the version of RDFLib you use, and if possible include a specific snippet of the file that causes the failure as an attachment so we can eliminate encoding problems?

I tried the following two snippets:

curl --silent http://kaiko.getalp.org/static/ontolex/latest/fr_dbnary_ontolex.ttl.bz2 \
  | bunzip2 - > fr_dbnary_ontolex.ttl
sha256sum fr_dbnary_ontolex.ttl 

sed -n -e '1,20p' -e '2093000,2095000p' fr_dbnary_ontolex.ttl \
  | pipx run --spec git+https://github.com/RDFLib/rdflib.git@master#egg=rdflib \
  rdfpipe -i ttl -o ntriples - > /dev/null

sed -n -e '1,20p' -e '2068698,2068740p' fr_dbnary_ontolex.ttl \
  | pipx run --spec git+https://github.com/RDFLib/rdflib.git@master#egg=rdflib \
  rdfpipe -i ttl -o ntriples - > /dev/null

But both of them work without problem.

aucampia commented 2 years ago
result = g.parse('fr_dbnary_ontolex.ttl', format='n3') #

Is there a reason why you are using n3 for a turtle file? Maybe try with ttl/turtle - n3 is not ratified AFAIK.

aucampia commented 2 years ago

Snippets work fine with n3 also:

sed -n -e '1,20p' -e '2093000,2095000p' fr_dbnary_ontolex.ttl \
  | pipx run --spec git+https://github.com/RDFLib/rdflib.git@master#egg=rdflib \
  rdfpipe -i n3 -o ntriples - > /dev/null

sed -n -e '1,20p' -e '2068698,2068740p' fr_dbnary_ontolex.ttl \
  | pipx run --spec git+https://github.com/RDFLib/rdflib.git@master#egg=rdflib \
  rdfpipe -i n3 -o ntriples - > /dev/null
aucampia commented 2 years ago

The whole file parses fine with turtle:

$ pipx run --spec git+https://github.com/RDFLib/rdflib.git@master#egg=rdflib \
>   rdfpipe -i ttl -o ntriples fr_dbnary_ontolex.ttl > /dev/null
⚠️  rdfpipe is already on your PATH and installed at
    /home/iwana/.local/bin/rdfpipe. Downloading and running anyway.
/home/iwana/.local/pipx/.cache/c3d9a88c982b951/lib64/python3.10/site-packages/rdflib/plugins/serializers/nt.py:36: UserWarning: NTSerializer always uses UTF-8 encoding. Given encoding was: None
  warnings.warn(

Can you double check the integrity of the file? I suspect that the encoding of the file you are processing is messed up:

$ sha256sum fr_dbnary_ontolex.ttl 
44a78a798f17835626f5fc1e433d7b644a01546cc31e12679de5bce79c1e0015  fr_dbnary_ontolex.ttl
keloemma commented 2 years ago

@keloemma please include the version of RDFLib you use, and if possible include a specific snippet of the file that causes the failure as an attachment so we can eliminate encoding problems?

I tried the following two snippets:

curl --silent http://kaiko.getalp.org/static/ontolex/latest/fr_dbnary_ontolex.ttl.bz2 \
  | bunzip2 - > fr_dbnary_ontolex.ttl
sha256sum fr_dbnary_ontolex.ttl 

sed -n -e '1,20p' -e '2093000,2095000p' fr_dbnary_ontolex.ttl \
  | pipx run --spec git+https://github.com/RDFLib/rdflib.git@master#egg=rdflib \
  rdfpipe -i ttl -o ntriples - > /dev/null

sed -n -e '1,20p' -e '2068698,2068740p' fr_dbnary_ontolex.ttl \
  | pipx run --spec git+https://github.com/RDFLib/rdflib.git@master#egg=rdflib \
  rdfpipe -i ttl -o ntriples - > /dev/null

But both of them work without problem.

Hello , thank you, I put a view of the beginning of the file in my first question since The file is quite huge the error might be at different location of the file.

I have another question I

aucampia commented 2 years ago

Presumably the whole file is this http://kaiko.getalp.org/static/ontolex/latest/fr_dbnary_ontolex.ttl.bz2?

I tried parsing it with n3 also and also had no problems:

$ time pipx run --spec git+https://github.com/RDFLib/rdflib.git@master#egg=rdflib rdfpipe -i n3 -o ntriples fr_dbnary_ontolex.ttl > /dev/null
⚠️  rdfpipe is already on your PATH and installed at
    /home/iwana/.local/bin/rdfpipe. Downloading and running anyway.
/home/iwana/.local/pipx/.cache/c3d9a88c982b951/lib64/python3.10/site-packages/rdflib/plugins/serializers/nt.py:36: UserWarning: NTSerializer always uses UTF-8 encoding. Given encoding was: None
  warnings.warn(

real    14m39.046s
user    14m25.563s
sys 0m10.268s

The first ~1000 lines of the file also parses fine:

$ sed -n -e '1,1004p' fr_dbnary_ontolex.ttl \
>   | pipx run --spec git+https://github.com/RDFLib/rdflib.git@master#egg=rdflib \
>   rdfpipe -i turtle -o ntriples - > /dev/null
⚠️  rdfpipe is already on your PATH and installed at
    /home/iwana/.local/bin/rdfpipe. Downloading and running anyway.
/home/iwana/.local/pipx/.cache/c3d9a88c982b951/lib64/python3.10/site-packages/rdflib/plugins/serializers/nt.py:36: UserWarning: NTSerializer always uses UTF-8 encoding. Given encoding was: None
  warnings.warn(

Could you download and unpack the file (http://kaiko.getalp.org/static/ontolex/latest/fr_dbnary_ontolex.ttl.bz2) anew and see if you still have problems? Where did you get your copy?

If you could upload the whole file to https://filebin.net/ or google drive it could also help

keloemma commented 2 years ago

Presumably the whole file is this http://kaiko.getalp.org/static/ontolex/latest/fr_dbnary_ontolex.ttl.bz2?

I tried parsing it with n3 also and also had no problems:

$ time pipx run --spec git+https://github.com/RDFLib/rdflib.git@master#egg=rdflib rdfpipe -i n3 -o ntriples fr_dbnary_ontolex.ttl > /dev/null
⚠️  rdfpipe is already on your PATH and installed at
    /home/iwana/.local/bin/rdfpipe. Downloading and running anyway.
/home/iwana/.local/pipx/.cache/c3d9a88c982b951/lib64/python3.10/site-packages/rdflib/plugins/serializers/nt.py:36: UserWarning: NTSerializer always uses UTF-8 encoding. Given encoding was: None
  warnings.warn(

real  14m39.046s
user  14m25.563s
sys   0m10.268s

The first ~1000 lines of the file also parses fine:

$ sed -n -e '1,1004p' fr_dbnary_ontolex.ttl \
>   | pipx run --spec git+https://github.com/RDFLib/rdflib.git@master#egg=rdflib \
>   rdfpipe -i turtle -o ntriples - > /dev/null
⚠️  rdfpipe is already on your PATH and installed at
    /home/iwana/.local/bin/rdfpipe. Downloading and running anyway.
/home/iwana/.local/pipx/.cache/c3d9a88c982b951/lib64/python3.10/site-packages/rdflib/plugins/serializers/nt.py:36: UserWarning: NTSerializer always uses UTF-8 encoding. Given encoding was: None
  warnings.warn(

Could you download and unpack the file (http://kaiko.getalp.org/static/ontolex/latest/fr_dbnary_ontolex.ttl.bz2) anew and see if you still have problems? Where did you get your copy?

If you could upload the whole file to https://filebin.net/ or google drive it could also help

hello, yes I used this (http://kaiko.getalp.org/static/ontolex/latest/fr_dbnary_ontolex.ttl.bz2 to download directly the file to my computer. I will redo I tried to parse the file. I will google drive to it.

aucampia commented 2 years ago

Once downloaded can you run rdfpipe on it also?

rdfpipe -i turtle -o n3 fr_dbnary_ontolex.ttl > /dev/null

It will take long but I just want to see if the problem is somehow different when parsing with rdfpipe (it should not be).

Also, still need your version of RDFLib, and your previous message cut off here:

I have another question I

If you have another question it got lost.

keloemma commented 2 years ago

Once downloaded can you run rdfpipe on it also?

rdfpipe -i turtle -o n3 fr_dbnary_ontolex.ttl > /dev/null

It will take long but I just want to see if the problem is somehow different when parsing with rdfpipe (it should not be).

Also, still need your version of RDFLib, and your previous message cut off here:

I have another question I

If you have another question it got lost.

My version of rdflib is 5.0.0

aucampia commented 2 years ago

My version of rdflib is 5.0.0

Could you try upgrade to the latest 6.1.1? I am fairly sure you are hitting a bug which was already fixed.

keloemma commented 2 years ago
sed -n -e '1,20p' -e '2093000,2095000p' fr_dbnary_ontolex.ttl \
  | pipx run --spec git+https://github.com/RDFLib/rdflib.git@master#egg=rdflib \
  rdfpipe -i ttl -o ntriples - > /dev/null

hello, I tried this :

emmanuelle@LAPTOP-S18ARGED:/mnt/c/Users/Emmanuelle$ sed -n -e '1,20p' -e '2093000,2095000p' fr_dbnary_ontolex.ttl \ | pipx run --spec git+https://github.com/RDFLib/rdflib.git@master#egg=rdflib \ rdfpipe -i ttl -o ntriples - > /dev/null usage: pipx run [-h] [--no-cache] [--pypackages] [--spec SPEC] [--verbose] [--python PYTHON] [--system-site-packages] [--index-url INDEX_URL] [--editable] [--pip-args PIP_ARGS] binary [binary_args [binary_args ...]] pipx run: error: the following arguments are required: binary

this is the link to the file I have : https://drive.google.com/drive/folders/19e-YPkALcOV_rdlcaZkBTM5iD-utRPAq?usp=sharing

aucampia commented 2 years ago

Your pipx is likely out of date, try:

rdfpipe --version
rdfpipe -i turtle -o n3 fr_dbnary_ontolex.ttl > /dev/null

It will take a long time though.

ghost commented 2 years ago

My version of rdflib is 5.0.0

Could you try upgrade to the latest 6.1.1? I am fairly sure you are hitting a bug which was already fixed.

idk, line 1533047 puts you right in the middle of an offending stanza of that file. I copied the prefixes and stanza to a small file (fr_dbnary_ontolex_snippet.ttl) and rdfpipe reported BadSyntax as expected.

aucampia commented 2 years ago

My version of rdflib is 5.0.0

Could you try upgrade to the latest 6.1.1? I am fairly sure you are hitting a bug which was already fixed.

idk, line 1533047 puts you right in the middle of an offending stanza of that file. I copied the prefixes and stanza to a small file (fr_dbnary_ontolex_snippet.ttl) and rdfpipe reported BadSyntax as expected.

I can parse the whole file I got from http://kaiko.getalp.org/static/ontolex/latest/fr_dbnary_ontolex.ttl.bz2 with rdflib (from master branch) using rdfpipe without problem both as n3 and ttl. I suspect something goes wrong somewhere else.

aucampia commented 2 years ago

I copied the prefixes and stanza to a small file (fr_dbnary_ontolex_snippet.ttl) and rdfpipe reported BadSyntax as expected.

As far as I can tell this should be equivalent to:

$ sed -n -e '1,20p' -e '1533041,1533053p' fr_dbnary_ontolex.ttl
@prefix dbetym:   <http://etytree-virtuoso.wmflabs.org/dbnaryetymology#> .
@prefix dbnary:   <http://kaiko.getalp.org/dbnary#> .
@prefix dbstats:  <http://kaiko.getalp.org/dbnary/statistics/> .
@prefix dcterms:  <http://purl.org/dc/terms/> .
@prefix decomp:   <http://www.w3.org/2002/07/owl#> .
@prefix fra:      <http://kaiko.getalp.org/dbnary/fra/> .
@prefix lexinfo:  <http://www.lexinfo.net/ontology/2.0/lexinfo#> .
@prefix lexvo:    <http://lexvo.org/id/iso639-3/> .
@prefix lime:     <http://www.w3.org/ns/lemon/lime#> .
@prefix olia:     <http://purl.org/olia/olia.owl#> .
@prefix ontolex:  <http://www.w3.org/ns/lemon/ontolex#> .
@prefix qb:       <http://purl.org/linked-data/cube#> .
@prefix rdf:      <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:     <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos:     <http://www.w3.org/2004/02/skos/core#> .
@prefix synsem:   <http://www.w3.org/ns/lemon/synsem#> .
@prefix vartrans: <http://www.w3.org/ns/lemon/vartrans#> .
@prefix wikt:     <https://fr.wiktionary.org/wiki/> .
@prefix xs:       <http://www.w3.org/2001/XMLSchema#> .

fra:magicienne__nom__1
        rdf:type                   ontolex:Word , ontolex:LexicalEntry ;
        rdfs:label                 "magicienne"@fr ;
        dbnary:approximateSynonym  fra:ensorceleuse , fra:sorcière ;
        dbnary:partOfSpeech        "-nom-" ;
        dbnary:synonym             <http://kaiko.getalp.org/dbnary/fra/🧙‍♀️> ;
        dcterms:language           lexvo:fra ;
        lexinfo:partOfSpeech       lexinfo:noun ;
        lime:language              "fr" ;
        ontolex:canonicalForm      fra:__cf_magicienne__nom__1 ;
        ontolex:sense              fra:__ws_1_magicienne__nom__1 , fra:__ws_2_magicienne__nom__1 , fra:__ws_3_magicienne__nom__1 .

And this parses fine without problems as n3 and ttl:

$ sed -n -e '1,20p' -e '1533041,1533053p' fr_dbnary_ontolex.ttl \
>   | pipx run --spec rdflib \
>   rdfpipe -i n3 -o ntriples - > /dev/null; echo $?
⚠️  rdfpipe is already on your PATH and installed at
    /home/iwana/.local/bin/rdfpipe. Downloading and running anyway.
/home/iwana/.local/pipx/.cache/f12773c15374377/lib64/python3.10/site-packages/rdflib/plugins/serializers/nt.py:36: UserWarning: NTSerializer always uses UTF-8 encoding. Given encoding was: None
  warnings.warn(
0
$ sed -n -e '1,20p' -e '1533041,1533053p' fr_dbnary_ontolex.ttl \
>   | pipx run --spec rdflib \
>   rdfpipe -i ttl -o ntriples - > /dev/null; echo $?
⚠️  rdfpipe is already on your PATH and installed at
    /home/iwana/.local/bin/rdfpipe. Downloading and running anyway.

/home/iwana/.local/pipx/.cache/f12773c15374377/lib64/python3.10/site-packages/rdflib/plugins/serializers/nt.py:36: UserWarning: NTSerializer always uses UTF-8 encoding. Given encoding was: None
  warnings.warn(
0

I really am looking for a way to get a problematic snipper from the original file with the following sha256sum:

$ sha256sum fr_dbnary_ontolex.ttl
44a78a798f17835626f5fc1e433d7b644a01546cc31e12679de5bce79c1e0015  fr_dbnary_ontolex.ttl

As I suspect this is basically just debugging problems with data that is slightly different from what is in the file.

aucampia commented 2 years ago

I copied the prefixes and stanza to a small file (fr_dbnary_ontolex_snippet.ttl) and rdfpipe reported BadSyntax as expected.

To be clear, this snippet does fail for me:

$ sha256sum fr_dbnary_ontolex_snippet.ttl 
57b3a42fb77a2b2fcdf467223d7f2e3e3cfa3bf118de10e7c0272cf02ffd812b  fr_dbnary_ontolex_snippet.ttl
$ pipx run --spec rdflib rdfpipe -i ttl -o ntriples fr_dbnary_ontolex_snippet.ttl > /dev/null; echo $?
⚠️  rdfpipe is already on your PATH and installed at
    /home/iwana/.local/bin/rdfpipe. Downloading and running anyway.
Traceback (most recent call last):
  File "/home/iwana/.local/pipx/.cache/f12773c15374377/bin/rdfpipe", line 8, in <module>
    sys.exit(main())
  File "/home/iwana/.local/pipx/.cache/f12773c15374377/lib64/python3.10/site-packages/rdflib/tools/rdfpipe.py", line 200, in main
    parse_and_serialize(
  File "/home/iwana/.local/pipx/.cache/f12773c15374377/lib64/python3.10/site-packages/rdflib/tools/rdfpipe.py", line 54, in parse_and_serialize
    graph.parse(fpath, format=use_format, **kws)
  File "/home/iwana/.local/pipx/.cache/f12773c15374377/lib64/python3.10/site-packages/rdflib/graph.py", line 1851, in parse
    context.parse(source, publicID=publicID, format=format, **args)
  File "/home/iwana/.local/pipx/.cache/f12773c15374377/lib64/python3.10/site-packages/rdflib/graph.py", line 1267, in parse
    raise se
  File "/home/iwana/.local/pipx/.cache/f12773c15374377/lib64/python3.10/site-packages/rdflib/graph.py", line 1258, in parse
    parser.parse(source, self, **args)  # type: ignore[call-arg]
  File "/home/iwana/.local/pipx/.cache/f12773c15374377/lib64/python3.10/site-packages/rdflib/plugins/parsers/notation3.py", line 1913, in parse
    p.loadStream(stream)
  File "/home/iwana/.local/pipx/.cache/f12773c15374377/lib64/python3.10/site-packages/rdflib/plugins/parsers/notation3.py", line 434, in loadStream
    return self.loadBuf(stream.read())  # Not ideal
  File "/home/iwana/.local/pipx/.cache/f12773c15374377/lib64/python3.10/site-packages/rdflib/plugins/parsers/notation3.py", line 440, in loadBuf
    self.feed(buf)
  File "/home/iwana/.local/pipx/.cache/f12773c15374377/lib64/python3.10/site-packages/rdflib/plugins/parsers/notation3.py", line 466, in feed
    i = self.directiveOrStatement(s, j)
  File "/home/iwana/.local/pipx/.cache/f12773c15374377/lib64/python3.10/site-packages/rdflib/plugins/parsers/notation3.py", line 488, in directiveOrStatement
    return self.checkDot(argstr, j)
  File "/home/iwana/.local/pipx/.cache/f12773c15374377/lib64/python3.10/site-packages/rdflib/plugins/parsers/notation3.py", line 1149, in checkDot
    self.BadSyntax(argstr, j, "expected '.' or '}' or ']' at end of statement")
  File "/home/iwana/.local/pipx/.cache/f12773c15374377/lib64/python3.10/site-packages/rdflib/plugins/parsers/notation3.py", line 1646, in BadSyntax
    raise BadSyntax(self._thisDoc, self.lines, argstr, i, msg)
rdflib.plugins.parsers.notation3.BadSyntax: at line 27 of <>:
Bad syntax (expected '.' or '}' or ']' at end of statement) at ^ in:
"...b'nonym             <http://kaiko.getalp.org/dbnary/fra/\xf0\x9f\xa7\x99<2'^b'00d>\xe2\x99\x80\xef\xb8\x8f> ;\n        dcterms:language           lexvo:fra ;'..."
1

But I am not sure how that snippet came from http://kaiko.getalp.org/static/ontolex/latest/fr_dbnary_ontolex.ttl.bz2.

aucampia commented 2 years ago

The whole file parses fine with RDFLib 6.1.1 also:

$ sha256sum fr_dbnary_ontolex.ttl 
44a78a798f17835626f5fc1e433d7b644a01546cc31e12679de5bce79c1e0015  fr_dbnary_ontolex.ttl
$ pipx run --spec rdflib rdfpipe --version
⚠️  rdfpipe is already on your PATH and installed at /home/iwana/.local/bin/rdfpipe. Downloading and running anyway.
rdfpipe (using rdflib 6.1.1)
$ time pipx run --spec rdflib rdfpipe -i ttl -o ntriples fr_dbnary_ontolex.ttl > /dev/null
⚠️  rdfpipe is already on your PATH and installed at
    /home/iwana/.local/bin/rdfpipe. Downloading and running anyway.
/home/iwana/.local/pipx/.cache/bc007ac94011e4d/lib64/python3.10/site-packages/rdflib/plugins/serializers/nt.py:36: UserWarning: NTSerializer always uses UTF-8 encoding. Given encoding was: None
  warnings.warn(

real    13m50.405s
user    13m38.356s
sys 0m9.360s
$ time pipx run --spec rdflib rdfpipe -i n3 -o ntriples fr_dbnary_ontolex.ttl > /dev/null
⚠️  rdfpipe is already on your PATH and installed at
    /home/iwana/.local/bin/rdfpipe. Downloading and running anyway.
/home/iwana/.local/pipx/.cache/bc007ac94011e4d/lib64/python3.10/site-packages/rdflib/plugins/serializers/nt.py:36: UserWarning: NTSerializer always uses UTF-8 encoding. Given encoding was: None
  warnings.warn(

real    14m23.740s
user    14m10.972s
sys 0m9.841s
ghost commented 2 years ago

I opened the file in vi, perhaps that's the issue.

ghost commented 2 years ago

I followed the procedure you described above but cannot replicate ...


time pipx run --spec rdflib rdfpipe -i ttl -o ntriples fr_dbnary_ontolex.ttl > /dev/null
⚠️  rdfpipe is already on your PATH and installed at
    /home/gjh/PyBench/rdflib-github-sqlitedb/.venv/bin/rdfpipe. Downloading
    and running anyway.
Killed

real    13m18.952s
user    10m21.459s
sys 0m32.021s```
keloemma commented 2 years ago

The whole file parses fine with RDFLib 6.1.1 also:

$ sha256sum fr_dbnary_ontolex.ttl 
44a78a798f17835626f5fc1e433d7b644a01546cc31e12679de5bce79c1e0015  fr_dbnary_ontolex.ttl
$ pipx run --spec rdflib rdfpipe --version
⚠️  rdfpipe is already on your PATH and installed at /home/iwana/.local/bin/rdfpipe. Downloading and running anyway.
rdfpipe (using rdflib 6.1.1)
$ time pipx run --spec rdflib rdfpipe -i ttl -o ntriples fr_dbnary_ontolex.ttl > /dev/null
⚠️  rdfpipe is already on your PATH and installed at
    /home/iwana/.local/bin/rdfpipe. Downloading and running anyway.
/home/iwana/.local/pipx/.cache/bc007ac94011e4d/lib64/python3.10/site-packages/rdflib/plugins/serializers/nt.py:36: UserWarning: NTSerializer always uses UTF-8 encoding. Given encoding was: None
  warnings.warn(

real  13m50.405s
user  13m38.356s
sys   0m9.360s
$ time pipx run --spec rdflib rdfpipe -i n3 -o ntriples fr_dbnary_ontolex.ttl > /dev/null
⚠️  rdfpipe is already on your PATH and installed at
    /home/iwana/.local/bin/rdfpipe. Downloading and running anyway.
/home/iwana/.local/pipx/.cache/bc007ac94011e4d/lib64/python3.10/site-packages/rdflib/plugins/serializers/nt.py:36: UserWarning: NTSerializer always uses UTF-8 encoding. Given encoding was: None
  warnings.warn(

real  14m23.740s
user  14m10.972s
sys   0m9.841s

I tried your solution but I get this : rdfpipe -i turtle -o n3 fr_dbnary_ontolex.ttl > /dev/null Segmentation fault (core dumped)

aucampia commented 2 years ago

I tried your solution but I get this : rdfpipe -i turtle -o n3 fr_dbnary_ontolex.ttl > /dev/null Segmentation fault (core dumped)

Can you share outputs of:

python --version
pip --version
pipx --version
rdfpipe --version

Also, please try:

python -m pip install --upgrade pipx

Then try again:

pipx --version
time pipx run --spec "rdflib==6.1.1" rdfpipe -i n3 -o ntriples fr_dbnary_ontolex.ttl

Also do try to upgrade to the latest version of RDFLib 6.1.1 and see if that does not eliminate your problem.

keloemma commented 2 years ago

python --version 3.9 pip --version 22.03 pipx --version 1.0.0 rdfpipe --version06.1.1

aucampia commented 2 years ago
time pipx run --spec "rdflib==6.1.1" rdfpipe -i n3 -o ntriples fr_dbnary_ontolex.ttl

And your original script is still failing?

keloemma commented 2 years ago
time pipx run --spec "rdflib==6.1.1" rdfpipe -i n3 -o ntriples fr_dbnary_ontolex.ttl

And your original script is still failing?

the command line was aborted due to not enough memory

time pipx run --spec "rdflib==6.1.1" rdfpipe -iry_ontolex.ttl ⚠️ rdfpipe is already on your PATH and installed at /linkhome/rech/genlig01/umg16uw/.conda/env Downloading and running anyway. Killed

I will relaunch in another directory. by the way what is dev/null ??

ghost commented 2 years ago

seeing the same thing ...

$ python --version
Python 3.8.10
$ pip --version
pip 20.0.2
$ pipx --version
1.0.0
$ rdfpipe --version
rdfpipe (using rdflib 6.2.0-alpha)
 pipx --version
1.0.0
$ time pipx run --spec "rdflib==6.1.1" rdfpipe -i n3 -o ntriples fr_dbnary_ontolex.ttl
⚠️  rdfpipe is already on your PATH and installed at /home/gjh/PyBench/rdflib-github-sqlitedb/.venv/bin/rdfpipe. Downloading and running anyway.
Killed

real    11m32.402s
user    10m43.690s
sys 0m11.269s

$ free
              total        used        free      shared  buff/cache   available
Mem:       32509308     8138536    22261988      762080     2108784    23140196
Swap:       2097148     2096860         288
keloemma commented 2 years ago

I inform the publisher, and he said that , the line who are problematic are perfectly normal because they are emoticon and for the database they are perfectly ok.

U+1F9D9 —> mage U+200D —> ZERO WIDTH JOINER U+2640 —> FEMALE SIGN U+FE0F —> VARIATION SELEC

@aucampia @gjhiggins

Just to tell you, It finally work, I reload the file from kakao and run the different codelines of provided by aucampia and then use my script. I updated the library to 6.1.1 and It parses. I could extract the file I wanted. thank you again :) . It took approx 40 min to parse and extract the words I needed. Thank you, again.

EDIT 👍

ghost commented 2 years ago

I inform the publisher, and he said that , the line who are problematic are perfectly normal because they are emoticon and for the database they are perfectly ok. He's correct, fuseki did successfully import the file, they were only warnings about illegal characters.

idk why I'm experiencing problems locally, sorry for confusing the issue.

keloemma commented 2 years ago

I inform the publisher, and he said that , the line who are problematic are perfectly normal because they are emoticon and for the database they are perfectly ok. He's correct, fuseki did successfully import the file, they were only warnings about illegal characters.

idk why I'm experiencing problems locally, sorry for confusing the issue.

I did on the server of my laboratory. Before I was doing it on my computer but as far as I am concerned, I coud not exactly know why it works now and not before.

aucampia commented 2 years ago

I inform the publisher, and he said that , the line who are problematic are perfectly normal because they are emoticon and for the database they are perfectly ok.

I'm going to emphasize again: http://kaiko.getalp.org/static/ontolex/latest/fr_dbnary_ontolex.ttl.bz2 parses fine with RDFLib 6.1.1 both as n3 and turtle. There is nothing wrong with it. All snippets that are supposedly problematic from this file also parses fine if I construct them directly from the file using sed.

What I need here to be of more help is some way to programatically generate a snippet from http://kaiko.getalp.org/static/ontolex/latest/fr_dbnary_ontolex.ttl.bz2 that fails to parse. Right now as far as I can tell no such snippet exists so I suspect something went wrong with the file before you are parsing it.

If your original problem persists then try print the sha256sum of the file before parsing it, and share the output.

EDIT: I see now the original problem no longer persists, will close this Issue then if there is nothing further.

aucampia commented 2 years ago

Closing as this as the problem no longer occurs for the reporter.