Open desislava-hristova-ontotext opened 4 years ago
Hi @desislava-hristova-ontotext the output seems not correct on semantic level that is for sure. We leave this open to the community to fix this (minor) extraction bug.
However we will start a discussion whether this triple should be filtered out from our parsed dbpedia release from databus into an erroneous triple file for future releases or not. Your help and input would be valuable for us. The parsing / triple validation at the moment is performed with Jena.
Jena as of 3.14 does not report an error.
➜ bin curl https://downloads.dbpedia.org/repo/lts/generic/persondata/2019.08.30/persondata_lang%3den.ttl.bz2 | lbzcat | riot --validate
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 290 100 290 0 0 1218 0 --:--:-- --:--:-- --:--:-- 1223
➜ bin
So if you think this should be excluded please post an issue on Jena so that they can fix the parser.
Moreover, is it possible for you to ignore the warnings with rdf4j and still load the file? I know for stardog there was a flag to disable strict parsing. Probably this also exist for GraphDB?
else if condition is to weak, should be something like if ( dt == rdflangString && languange )
Finally the missing language gets handled wrong here https://github.com/dbpedia/extraction-framework/blob/91577ca39df1bc4a6a6aab5fb88d0e0a069df816/core/src/main/scala/org/dbpedia/extraction/destinations/formatters/TerseBuilder.scala#L36 So still not sure where this is build
➜ 20200220 git:(master) ✗ curl http://dbpedia-generic.tib.eu/release/generic/persondata/2019.07.01/persondata_lang\=en.ttl.bz2 | lbunzip2| grep langString | head
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 266 100 266 0 0 4030 0 --:--:-- --:--:-- --:--:-- 4030
➜ 20200220 git:(master) ✗ curl http://dbpedia-generic.tib.eu/release/generic/persondata/2019.08.01/persondata_lang\=en.ttl.bz2 | lbunzip2| grep langString | head
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 266 100 266 0 0 4666 0 --:--:-- --:--:-- --:--:-- 4666
➜ 20200220 git:(master) ✗ curl http://dbpedia-generic.tib.eu/release/generic/persondata/2019.08.30/persondata_lang\=en.ttl.bz2 | lbunzip2| grep langString | head
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 290 100 290 0 0 5087 0 --:--:-- --:--:-- --:--:-- 5087
<http://dbpedia.org/resource/Jim_Pewter> <http://xmlns.com/foaf/0.1/name> "Jim Pewter"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#langString> .
<http://dbpedia.org/resource/Jim_Pewter> <http://xmlns.com/foaf/0.1/surname> "Pewter"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#langString> .
<http://dbpedia.org/resource/Jim_Pewter> <http://xmlns.com/foaf/0.1/givenName> "Jim"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#langString> .
<http://dbpedia.org/resource/Jim_Pewter> <http://purl.org/dc/elements/1.1/description> "American radio personality"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#langString> .
➜ 20200220 git:(master) ✗ curl http://dbpedia-generic.tib.eu/release/generic/persondata/2019.10.01/persondata_lang\=en.ttl.bz2 | lbunzip2| grep langString | head
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 337 100 337 0 0 5106 0 --:--:-- --:--:-- --:--:-- 5106
<http://dbpedia.org/resource/Jim_Pewter> <http://xmlns.com/foaf/0.1/name> "Jim Pewter"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#langString> .
<http://dbpedia.org/resource/Jim_Pewter> <http://xmlns.com/foaf/0.1/surname> "Pewter"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#langString> .
<http://dbpedia.org/resource/Jim_Pewter> <http://xmlns.com/foaf/0.1/givenName> "Jim"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#langString> .
<http://dbpedia.org/resource/Jim_Pewter> <http://purl.org/dc/elements/1.1/description> "American radio personality"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#langString> .
The error is produced since version 08.30 (marvin extraction). Since then we included two preprocessing streps.
../run ResolveTransitiveLinks $EXTRACTIONBASEDIR redirects redirects_transitive .ttl.bz2 @downloaded
../run MapObjectUris $EXTRACTIONBASEDIR redirects_transitive .ttl.bz2 disambiguations,infobox-properties,page-links,persondata,topical-concepts _redirected .ttl.bz2 @downloaded
https://git.informatik.uni-leipzig.de/dbpedia-assoc/marvin-config/blob/master/functions.sh#L67
@Vehnem Can we add a test for this type of errors? Or we already have such a test?
The following dbpedia files (and probably more) contain invalid literals https://downloads.dbpedia.org/repo/lts/generic/infobox-properties/2019.08.30/infobox-properties_lang%3den.ttl.bz2 https://downloads.dbpedia.org/repo/lts/generic/persondata/2019.08.30/persondata_lang%3den.ttl.bz2 with rdf:langString but without language tag.
See: https://www.w3.org/TR/rdf11-concepts/#dfn-language-tagged-string
All such files cannot be loaded using RDF4J as it does not tolerate it and returns an error: "RDF Parse Error: datatype rdf:langString requires a language tag [line 1]"