AKSW / QuitDiff

Command line comparison tool for semantic web data, it can be used as git difftool for RDF data as well.
GNU General Public License v3.0
7 stars 2 forks source link

problems with escapes in nt files #9

Closed JJ-Author closed 6 years ago

JJ-Author commented 6 years ago
~/difftest/QuitDiff/bin/quit-diff --diffFormat=eccrev . /media/bigone/25TB/www/downloads.dbpedia.org/tmpdev/diffs/new/mappings/geo-coordinates-mappingbased/2018.10.16/geo-coordinates-mappingbased-2018.10.16_nlwiki.parsed-nt.deletes.nt  1 2 /media/bigone/25TB/www/downloads.dbpedia.org/tmpdev/diffs/new/mappings/geo-coordinates-mappingbased/2018.10.16/geo-coordinates-mappingbased-2018.10.16_nlwiki.parsed-nt.adds.nt 

http://nl.dbpedia.org/resource/Lijst_van_gemeentelijke_monumenten_in_Gouda_(centrum)_H_t/m_Q__Woonhuis_met_de_tekst:_"GIJ_KINDEREN_LOOFT_DEN_HEE__1 does not look like a valid URI, trying to serialize this will break.
http://nl.dbpedia.org/resource/Lijst_van_gemeentelijke_monumenten_in_Gouda_(centrum)_H_t/m_Q__Woonhuis_met_de_tekst:_"GIJ_KINDEREN_LOOFT_DEN_HEE__1 does not look like a valid URI, trying to serialize this will break.
http://nl.dbpedia.org/resource/Lijst_van_gemeentelijke_monumenten_in_Gouda_(centrum)_H_t/m_Q__Woonhuis_met_de_tekst:_"GIJ_KINDEREN_LOOFT_DEN_HEE__1 does not look like a valid URI, trying to serialize this will break.
http://nl.dbpedia.org/resource/Lijst_van_gemeentelijke_monumenten_in_Gouda_(centrum)_H_t/m_Q__Woonhuis_met_de_tekst:_"GIJ_KINDEREN_LOOFT_DEN_HEE__1 does not look like a valid URI, trying to serialize this will break.
http://nl.dbpedia.org/resource/Lijst_van_gemeentelijke_monumenten_in_Gouda_(centrum)_H_t/m_Q__Woonhuis_met_de_tekst:_"HIJ_HEEFT_HET_U_GEGEVEN".__1 does not look like a valid URI, trying to serialize this will break.
http://nl.dbpedia.org/resource/Lijst_van_gemeentelijke_monumenten_in_Gouda_(centrum)_H_t/m_Q__Woonhuis_met_de_tekst:_"HIJ_HEEFT_HET_U_GEGEVEN".__1 does not look like a valid URI, trying to serialize this will break.
http://nl.dbpedia.org/resource/Lijst_van_gemeentelijke_monumenten_in_Gouda_(centrum)_H_t/m_Q__Woonhuis_met_de_tekst:_"HIJ_HEEFT_HET_U_GEGEVEN".__1 does not look like a valid URI, trying to serialize this will break.
http://nl.dbpedia.org/resource/Lijst_van_gemeentelijke_monumenten_in_Gouda_(centrum)_H_t/m_Q__Woonhuis_met_de_tekst:_"HIJ_HEEFT_HET_U_GEGEVEN".__1 does not look like a valid URI, trying to serialize this will break.
Traceback (most recent call last):
  File "/home/johannes/difftest/QuitDiff/bin/quit-diff", line 6, in <module>
    quit_diff.main()
  File "/usr/local/lib/python3.4/dist-packages/quit_diff/__init__.py", line 25, in main
    quitdiff.diff(args.path, args.oldFile, args.newFile, diffFormat=args.diffFormat)
  File "/usr/local/lib/python3.4/dist-packages/quit_diff/QuitDiff.py", line 74, in diff
    self.difftool(oldFile, newFile, None, None, diffFormat=diffFormat)
  File "/usr/local/lib/python3.4/dist-packages/quit_diff/QuitDiff.py", line 113, in difftool
    print(diffSerializer.serialize(add, remove))
  File "/usr/local/lib/python3.4/dist-packages/quit_diff/serializer/EccrevDiff.py", line 46, in serialize
    return g.serialize(format="trig").decode("utf-8")
  File "/usr/local/lib/python3.4/dist-packages/rdflib/graph.py", line 942, in serialize
    serializer.serialize(stream, base=base, encoding=encoding, **args)
  File "/usr/local/lib/python3.4/dist-packages/rdflib/plugins/serializers/trig.py", line 85, in serialize
    if self.statement(subject) and not firstTime:
  File "/usr/local/lib/python3.4/dist-packages/rdflib/plugins/serializers/turtle.py", line 270, in statement
    return self.s_squared(subject) or self.s_default(subject)
  File "/usr/local/lib/python3.4/dist-packages/rdflib/plugins/serializers/turtle.py", line 274, in s_default
    self.path(subject, SUBJECT)
  File "/usr/local/lib/python3.4/dist-packages/rdflib/plugins/serializers/turtle.py", line 289, in path
    or self.p_default(node, position, newline)):
  File "/usr/local/lib/python3.4/dist-packages/rdflib/plugins/serializers/turtle.py", line 295, in p_default
    self.write(self.label(node, position))
  File "/usr/local/lib/python3.4/dist-packages/rdflib/plugins/serializers/turtle.py", line 311, in label
    return self.getQName(node, position == VERB) or node.n3()
  File "/usr/local/lib/python3.4/dist-packages/rdflib/term.py", line 230, in n3
    raise Exception('"%s" does not look like a valid URI, I cannot serialize this as N3/Turtle. Perhaps you wanted to urlencode it?'%self)
JJ-Author commented 6 years ago

you can download the files in question online http://downloads.dbpedia.org/tmpdev/diffs/new//mappings/geo-coordinates-mappingbased/2018.10.16/geo-coordinates-mappingbased-2018.10.16_nlwiki.parsed-nt.deletes.nt

http://downloads.dbpedia.org/tmpdev/diffs/new//mappings/geo-coordinates-mappingbased/2018.10.16/geo-coordinates-mappingbased-2018.10.16_nlwiki.parsed-nt.adds.nt

white-gecko commented 6 years ago

The URIs/IRIs which you show contain " which is not a valid character for URIs or IRIs as pointed out here: https://github.com/RDFLib/rdflib/issues/703 You should input valid RDF documents to QuitDiff. Maybe this discussion helps you: https://github.com/RDFLib/rdflib/issues/412

JJ-Author commented 6 years ago

well that is confusing since the files which are used as input have been validated and serialized by raptor beforehand. I can not tell whether the behavior is correct or not. I can confirm that it is not a reserved character in an IRI and there seems no production rule deriving a ". I can confirm that turtle 1.1 allows to contain IRIs with quotes (escaped with a backslashu) so it is syntactically correct. IRI validator online tool http://sparql.org/validate/iri?iri=http%3A%2F%2Fnl.dbpedia.org%2Fresource%2FLijst_van_gemeentelijke_monumenten_in_Gouda_%28centrum%29_H_t%2Fm_Q__Woonhuis_met_de_tekst%3A_%22HIJ_HEEFT_HET_U_GEGEVEN%22.__1

adds up to the confusion "4/UNWISE_CHARACTER in PATH: The character matches no grammar rules of URIs/IRIs. These characters are permitted in RDF URI References, XML system identifiers, and XML Schema anyURIs."