dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
851 stars 270 forks source link

DBPedia returns in valid RDF #676

Open reckart opened 3 years ago

reckart commented 3 years ago

Issue still valid?

DBpedia updates frequently in this order: 1. DIEF software, 2. monthly dumps, 3. online services loaded from dumps. We update http://dief.tools.dbpedia.org/server/extraction/ on a daily basis from the git and it reflects the current state. Please verify your issue with this service, e.g. http://dief.tools.dbpedia.org/server/extraction/en/extract?title=United+States Please add the link you used for verification:

Not sure what you want me to validate here. You can validate the issue using the "execute query" link below.

Source

Where did you find the data issue? Pick one, remove the others.

Web / SPARQL

State the service (e.g. http://dbpedia.org/sparql) and the SPARQL query
give a link to the web / linked data pages (e.g. http://dbpedia.org/resource/Berlin)

Here is a link to reproduce the issue: execute query.

Error Description

Please state the nature of your technical emergency:

The query returns a literal tagged as a langString, but it does not include a language.

<sparql xmlns="http://www.w3.org/2005/sparql-results#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/sw/DataAccess/rf1/result2.xsd">
 <head>
  <variable name="lc"/>
  <variable name="subj"/>
 </head>
 <results distinct="false" ordered="true">
  <result>
   <binding name="lc"><literal xml:lang="de">Altin Lala</literal></binding>
   <binding name="subj"><uri>http://de.dbpedia.org/resource/Altin_Lala</uri></binding>
  </result>
  <result>
   <binding name="lc"><literal xml:lang="de">Hans Lala</literal></binding>
   <binding name="subj"><uri>http://de.dbpedia.org/resource/Hans_Lala</uri></binding>
  </result>
  <result>
   <binding name="lc"><literal xml:lang="de">Maharan Lala</literal></binding>
   <binding name="subj"><uri>http://de.dbpedia.org/resource/Maharan_Lala</uri></binding>
  </result>
  <result>
   <binding name="lc"><literal xml:lang="de">Jiri Lala</literal></binding>
   <binding name="subj"><uri>http://de.dbpedia.org/resource/Jiri_Lala</uri></binding>
  </result>
  <result>
   <binding name="lc"><literal xml:lang="de">Jan Lala</literal></binding>
   <binding name="subj"><uri>http://de.dbpedia.org/resource/Jan_Lala</uri></binding>
  </result>
  <result>
   <binding name="lc"><literal datatype="http://www.w3.org/1999/02/22-rdf-syntax-ns#langString">Hans Lala</literal></binding>
   <binding name="subj"><uri>http://de.dbpedia.org/resource/Hans_Lala</uri></binding>
  </result>
 </results>
</sparql>

This is invalid according to the RDF specs. (Ref: https://github.com/eclipse/rdf4j/issues/2815)

Just for info from the RDF 1.1 spec:

"if and only if the datatype IRI is http://www.w3.org/1999/02/22-rdf-syntax-ns#langString, a non-empty language tag as defined by [BCP47]. The language tag must be well-formed according to section 2.2.9 of [BCP47]."

https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal

Error specification

Pick the appropriate:

I would guess that omitting the datatype or changing it to string should probably work.

Additional context

kurzum commented 3 years ago

@reckart hm, the template description of issue still valid doesn't seem clear. This issue has been fixed as can be seen here:

Anyhow. de.dbpedia.org/sparql needs fresher data. Web Extraction and latest downloads are clean.

reckart commented 3 years ago

@kurzum thanks for the response. How old is the data on de.dbpedia.org? We already had the issue back in May 2019 if not earlier.

That said: at least de.dbpedia.org doesn't fail on certain SPARQL queries as dbpedia.org appears to do since the recent Virtuoso upgrade.

reckart commented 3 years ago

@kurzum do you know if this was a systematic bug in the code that builds DBPedia or is it something that could bite users again on another tripple (not Hans Lala)?

kurzum commented 3 years ago

@kurzum thanks for the response. How old is the data on de.dbpedia.org? We already had the issue back in May 2019 if not earlier.

I don't know how old it is exactly. We are switching to the new system, where all the files are produced monthly and versioned with the Databus. Then you would know exactly what files from which month are loaded.

That said: at least de.dbpedia.org doesn't fail on certain SPARQL queries as dbpedia.org appears to do since the recent Virtuoso upgrade.

Well you say this. Probably, if we update de.dbpedia.org we will get issues that the missing rdf:langString makes queries fail ;)

@kurzum do you know if this was a systematic bug in the code that builds DBPedia or is it something that could bite users again on another tripple (not Hans Lala)?

We build one of the biggest data test frameworks. Please read The New DBpedia Release Cycle: Increasing Agility and Efficiency in Knowledge Extraction Workflows So this one is covered. Not all tests run perfectly, but we made it so mvn test fails, if a fixed issue reoccurs. The challenge is a mammoth. A full release has 22 billion triples and then loading them into an application adds an additional layer of problems. It is much more complex and hard than fixing bugs in software only. We are currently bringing this on the road, i.e. in particular figure 1 of the paper. You saw the new templates for issues. The goal here is to bring down the time to verify, locate, fix an issue to 30 minutes, which would melt down the thousand small problems everywhere.

reckart commented 3 years ago

The reason I am asking whether this issue was fixed systemaically because over at RDF4J, I am lobbying for making the SPARQL results parser a bit more robust/lenient in the face of this particular issue (i.e. langString without lang) so that it still parses the result but returns it as a string instead of a langString.

What do you think? Is this kind of problem one that the data providers should have to fix or should query results parsers such as the one in RDF4J be able to gracefully handle such problems with the data?

reckart commented 3 years ago

That said: at least de.dbpedia.org doesn't fail on certain SPARQL queries as dbpedia.org appears to do since the recent Virtuoso upgrade. Well you say this. Probably, if we update de.dbpedia.org we will get issues that the missing rdf:langString makes queries fail ;)

I'm more referring to this particular issue here which I believe appears to be a bug in the Virtuoso query compiler: https://github.com/dbpedia/extraction-framework/issues/672

reckart commented 3 years ago

Funny - I just noticed this other report about langString/string issues just a bit down in the issue list: https://github.com/dbpedia/extraction-framework/issues/603

kurzum commented 3 years ago

What do you think? Is this kind of problem one that the data providers should have to fix or should query results parsers such as the one in RDF4J be able to gracefully handle such problems with the data?

Neither, I think, that we need better debugging tools. e.g. the framework we describe in the paper has these kind of tests and they can be transferred well to other data. I would see the problem in:

  1. hardly any linters, syntax highlighting in IDE, i.e. missing tooling
  2. most parsers stop if they find the first syntax error. Especially for ntriples they should recover. Otherwise I am with Jon Postel, although I need to admit that producing correct data is very hard and there is bad tooling.
kurzum commented 3 years ago

I'm more referring to this particular issue here which I believe appears to be a but in the Virtuoso query compiler: #672

which is already being fixed (not sure about priority)

reckart commented 3 years ago

Well, as a person "in the middle" who is neither producing the data nor developing the RDF libraries, having (slightly) invalid data and strict RDF libraries essentially would lock me out from using the semantic resources. As such, considering that having perfect data is very hard and making RDF libraries more resilient is at least a realistic possibility - I think I'll continue to lobby for the latter - that doesn't mean that data and related tooling should not become better - but it means the data becomes more accessible while perfection is being worked towards ;)

kurzum commented 3 years ago

refiled under hosting