dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
856 stars 269 forks source link

Some dbpedia files contain invalid literals with rdf:langString and empty language tag #603

Open desislava-hristova-ontotext opened 4 years ago

desislava-hristova-ontotext commented 4 years ago

The following dbpedia files (and probably more) contain invalid literals https://downloads.dbpedia.org/repo/lts/generic/infobox-properties/2019.08.30/infobox-properties_lang%3den.ttl.bz2 https://downloads.dbpedia.org/repo/lts/generic/persondata/2019.08.30/persondata_lang%3den.ttl.bz2 with rdf:langString but without language tag.

See: https://www.w3.org/TR/rdf11-concepts/#dfn-language-tagged-string

All such files cannot be loaded using RDF4J as it does not tolerate it and returns an error: "RDF Parse Error: datatype rdf:langString requires a language tag [line 1]"

JJ-Author commented 4 years ago

Hi @desislava-hristova-ontotext the output seems not correct on semantic level that is for sure. We leave this open to the community to fix this (minor) extraction bug.

However we will start a discussion whether this triple should be filtered out from our parsed dbpedia release from databus into an erroneous triple file for future releases or not. Your help and input would be valuable for us. The parsing / triple validation at the moment is performed with Jena.

Jena as of 3.14 does not report an error.

➜ bin curl https://downloads.dbpedia.org/repo/lts/generic/persondata/2019.08.30/persondata_lang%3den.ttl.bz2 | lbzcat | riot --validate
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   290  100   290    0     0   1218      0 --:--:-- --:--:-- --:--:--  1223
➜  bin 

So if you think this should be excluded please post an issue on Jena so that they can fix the parser.

Moreover, is it possible for you to ignore the warnings with rdf4j and still load the file? I know for stardog there was a flag to disable strict parsing. Probably this also exist for GraphDB?

Vehnem commented 4 years ago

https://github.com/dbpedia/extraction-framework/blob/91577ca39df1bc4a6a6aab5fb88d0e0a069df816/core/src/main/scala/org/dbpedia/extraction/destinations/formatters/TripleFormatter.scala#L19

else if condition is to weak, should be something like if ( dt == rdflangString && languange )

Vehnem commented 4 years ago

Finally the missing language gets handled wrong here https://github.com/dbpedia/extraction-framework/blob/91577ca39df1bc4a6a6aab5fb88d0e0a069df816/core/src/main/scala/org/dbpedia/extraction/destinations/formatters/TerseBuilder.scala#L36 So still not sure where this is build

Vehnem commented 4 years ago
➜  20200220 git:(master) ✗ curl http://dbpedia-generic.tib.eu/release/generic/persondata/2019.07.01/persondata_lang\=en.ttl.bz2 | lbunzip2| grep langString | head
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   266  100   266    0     0   4030      0 --:--:-- --:--:-- --:--:--  4030
➜  20200220 git:(master) ✗ curl http://dbpedia-generic.tib.eu/release/generic/persondata/2019.08.01/persondata_lang\=en.ttl.bz2 | lbunzip2| grep langString | head
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   266  100   266    0     0   4666      0 --:--:-- --:--:-- --:--:--  4666
➜  20200220 git:(master) ✗ curl http://dbpedia-generic.tib.eu/release/generic/persondata/2019.08.30/persondata_lang\=en.ttl.bz2 | lbunzip2| grep langString | head
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   290  100   290    0     0   5087      0 --:--:-- --:--:-- --:--:--  5087
<http://dbpedia.org/resource/Jim_Pewter> <http://xmlns.com/foaf/0.1/name> "Jim Pewter"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#langString> .
<http://dbpedia.org/resource/Jim_Pewter> <http://xmlns.com/foaf/0.1/surname> "Pewter"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#langString> .
<http://dbpedia.org/resource/Jim_Pewter> <http://xmlns.com/foaf/0.1/givenName> "Jim"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#langString> .
<http://dbpedia.org/resource/Jim_Pewter> <http://purl.org/dc/elements/1.1/description> "American radio personality"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#langString> .
➜  20200220 git:(master) ✗ curl http://dbpedia-generic.tib.eu/release/generic/persondata/2019.10.01/persondata_lang\=en.ttl.bz2 | lbunzip2| grep langString | head
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   337  100   337    0     0   5106      0 --:--:-- --:--:-- --:--:--  5106
<http://dbpedia.org/resource/Jim_Pewter> <http://xmlns.com/foaf/0.1/name> "Jim Pewter"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#langString> .
<http://dbpedia.org/resource/Jim_Pewter> <http://xmlns.com/foaf/0.1/surname> "Pewter"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#langString> .
<http://dbpedia.org/resource/Jim_Pewter> <http://xmlns.com/foaf/0.1/givenName> "Jim"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#langString> .
<http://dbpedia.org/resource/Jim_Pewter> <http://purl.org/dc/elements/1.1/description> "American radio personality"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#langString> .

The error is produced since version 08.30 (marvin extraction). Since then we included two preprocessing streps.

../run ResolveTransitiveLinks $EXTRACTIONBASEDIR redirects redirects_transitive .ttl.bz2 @downloaded   
../run MapObjectUris $EXTRACTIONBASEDIR redirects_transitive .ttl.bz2 disambiguations,infobox-properties,page-links,persondata,topical-concepts _redirected .ttl.bz2 @downloaded

https://git.informatik.uni-leipzig.de/dbpedia-assoc/marvin-config/blob/master/functions.sh#L67

m1ci commented 4 years ago

@Vehnem Can we add a test for this type of errors? Or we already have such a test?