dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
855 stars 269 forks source link

Missing comments and abstracts for english for multiple articles #714

Open pkleef opened 3 years ago

pkleef commented 3 years ago

Issue validity

Some explanation: DBpedia Snapshot is produced every three months, see Release Frequency & Schedule, which is loaded into http://dbpedia.org/sparql . During these three months, Wikipedia changes and also the DBpedia Information Extraction Framework receives patches. At http://dief.tools.dbpedia.org/server/extraction/en/ we host a daily updated extraction web service that can extract one Wikipedia page at a time. To check whether your issue is still valid, please enter the article name, e.g. Berlin or Joe_Biden here: http://dief.tools.dbpedia.org/server/extraction/en/ If the issue persists, please post the link from your browser here:

https://dbpedia.org/resource/Eating_your_own_dog_food?lang= https://dbpedia.org/resource/Paul_Erd%C5%91s?lang=

NOTE: http://dief.tools.dbpedia.org/server/extraction/en/Eating_your_own_dog_food returns an error at this time

Error Description

Please state the nature of your technical emergency:

I received several reports of articles with missing english (and possible other language) triples for dbo:abstract and dbo:comment.

Pinpointing the source of the error

Where did you find the data issue? Non-exhaustive options are:

Error occurs on the current 2021-06 snapshot of the databus dump that is loaded on http://dbpedia.org/sparql

jlareck commented 3 years ago

It is very interesting that this error occurs also with http://dief.tools.dbpedia.org/server/extraction/en/Eating_your_own_dog_food , because when I run the server locally on my machine, it extracted abstracts for English:

Screenshot 2021-09-14 at 13 28 15

Then I guess the error on the server can be related to the configuration that is used by dief.tools.dbpedia.org/server/ (maybe it uses old version of extraction framework).

JJ-Author commented 3 years ago

@jlareck can you patch the html in order that it shows the commit (optionally also the branch) it is using (that has a hyperlink to github). I think sometimes the cronjob fails or the redeploy script is not mature yet. so yes the service was out of date. but it is hard to recognize. so displaying this simple information here http://dief.tools.dbpedia.org/server/extraction/en/ could really help

jlareck commented 3 years ago

@JJ-Author not completely clear what I need to do. So I need to find the commit where this error occurs, am I right?

JJ-Author commented 3 years ago

no just write a commit which prints out the current commit hash of the build on the DIEF extractor webpage. you could use sth. like this https://github.com/git-commit-id/git-commit-id-maven-plugin.

JJ-Author commented 3 years ago

by the way i updated the webservice now manually. but we dont know for sure because we dont see which commit it is using.

jlareck commented 3 years ago

Oh, well I also noticed another thing. So, maybe the problem was in not correct usage of api server call, because this url works fine: http://dief.tools.dbpedia.org/server/extraction/en/extract?title=Eating+your+own+dog+food&revid=&format=trix&extractors=custom . @pkleef Could you please check it and say if it is expected result? Or maybe I don't understand what result must be on the server

jlareck commented 3 years ago

@jlareck can you patch the html in order that it shows the commit (optionally also the branch) it is using (that has a hyperlink to github).

no just write a commit that prints out the current commit hash of the build on the DIEF extractor webpage.

@JJ-Author, as I understood, I need to add current commit information on the DIEF server page and link to the commit on the github page. And I can add it somewhere at the top of the page (for example near the ontology) or at the footer?

Screenshot 2021-09-16 at 11 58 12
pkleef commented 3 years ago

@jlareck I can confirm your dief tools link to the article does show the triples i do not see when loading the 2021-06 Databus snapshot on the http://dbpedia.org/sparql endpoint.

See this result:

https://dbpedia.org/sparql?query=select+lang%28%3Fcomment%29++%3Fcomment+where+%7B%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FEating_your_own_dog_food%3E+rdfs%3Acomment+%3Fcomment%7D

My main concern is that the Databus dump apparently was reported as successful, yet on a number of articles did not dump the english abstracts and comments.

The DBpedia team needs to figure out why these comments were not dumped as this could be an indication that extraction errors are not properly caught and reported.


As a side note for the DIEF tool, i see i used the wrong URL form:

http://dief.tools.dbpedia.org/server/extraction/en/Eating_your_own_dog_food

but i was not expecting a Java Exception com.sun.jersey.api.NotFoundException: null for uri.

Would it be possible to add some argument checking and produce a slightly more informative error page?

JJ-Author commented 3 years ago

@jlareck can you patch the html in order that it shows the commit (optionally also the branch) it is using (that has a hyperlink to github).

no just write a commit that prints out the current commit hash of the build on the DIEF extractor webpage.

@JJ-Author, as I understood, I need to add current commit information on the DIEF server page and link to the commit on the github page. And I can add it somewhere at the top of the page (for example near the ontology) or at the footer?

Screenshot 2021-09-16 at 11 58 12

Yes I think this would make sense, right? so that we always know whether we are using the latest code?

JJ-Author commented 3 years ago

@jlareck I can confirm your dief tools link to the article does show the triples i do not see when loading the 2021-06 Databus snapshot on the http://dbpedia.org/sparql endpoint.

See this result:

https://dbpedia.org/sparql?query=select+lang%28%3Fcomment%29++%3Fcomment+where+%7B%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FEating_your_own_dog_food%3E+rdfs%3Acomment+%3Fcomment%7D

My main concern is that the Databus dump apparently was reported as successful, yet on a number of articles did not dump the english abstracts and comments.

@Vehnem @kurzum Maybe it makes sense to have some metrics here? like number of abstracts compared in total and compared to total number of entities. So that we can track whether abstracts are getting more or less from release to release? and maybe track this also for other artifacts? Maybe using the void mods?

The DBpedia team needs to figure out why these comments were not dumped as this could be an indication that extraction errors are not properly caught and reported.

@jlareck do you know whether exception statistics / summary for the extraction are written in general? I know there is a logging of exceptions.

@pkleef my best guess is that the commit used for the 2021-06 extraction did not have the fix yet. @Vehnem @jlareck is there a way to determin the commit hash for a marvin extraction now? But in general I assume that there is a gap in terminology. Successful so far means nothing crashed or aborted, (so ideally no missing files). But indeed missing triples or the number of exceptions per extractor could be used as quality indicators to judge a "successful" release in the future.

jlareck commented 3 years ago

@JJ-Author I think exception statistics and summary is written for each language wikidump separately. So, as I understand, we can see how many pages were successfully extracted and how many failed for example after extraction of the English wikidump.

jlareck commented 3 years ago

Well, I found the reason why Eating_your_own_dog_food was not extracted. So, during English extraction, there were too many requests using wikimedia API and that's why extraction of this page failed. Here is error for this page:

Exception; en; Main Extraction at 46:38.508s for 4 datasets; Main Extraction failed for instance http://dbpedia.org/resource/Eating_your_own_dog_food: Server returned HTTP response code: 429 for URL: https://en.wikipedia.org/w/api.php 

This error occurred very often during June extraction: for the English dump, there were 910000 exceptions Server returned HTTP response code: 429 during the extraction.

jlareck commented 3 years ago

I and Marvin checked logs one more time today and this exception occurred not 910000 but 455000 times during June extraction (I didn't calculate it correctly the first time). But anyway this is still a huge number. And the other interesting moment is that during August extraction this error occurred only 723 times for English (anything related to requests in the Extraction Framework was not changed during this period). Also, we compared the number of triples in each dataset (June and August) and in June the number of triples was 5460872 and in August there were 5952058 extracted abstracts. Still very confusing and we are trying to inspect it more

JJ-Author commented 3 years ago

The Wikipedia API is heavily used, so maybe there needs to be some kind of request control firing not too many request per second. I can imagine that they have a load balancer, so in case not many load is on the system they are gracious. It is maybe also worth having a look here https://www.mediawiki.org/wiki/API:Etiquette

JJ-Author commented 3 years ago

as I said the number of triples is more expressive when compared to the number of articles extracted. But for me it seems reasonable? less failed requests -> more triples