dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
854 stars 269 forks source link

Images dataset contains wrong triples #720

Open jlareck opened 2 years ago

jlareck commented 2 years ago

Issue validity

Some explanation: DBpedia Snapshot is produced every three months, see Release Frequency & Schedule, which is loaded into http://dbpedia.org/sparql . During these three months, Wikipedia changes and also the DBpedia Information Extraction Framework receives patches. At http://dief.tools.dbpedia.org/server/extraction/en/ we host a daily updated extraction web service that can extract one Wikipedia page at a time. To check whether your issue is still valid, please enter the article name, e.g. Berlin or Joe_Biden here: http://dief.tools.dbpedia.org/server/extraction/en/ If the issue persists, please post the link from your browser here:

https://dbpedia.org/page/Borysthenia_goldfussiana https://dbpedia.org/page/Ingoldiomyces There are more triples in the DBpedia snapshot 2021-09 that contain this issue

Error Description

Please state the nature of your technical emergency:

Looks like ImageExtractorNew produces triples from Wikipedia pages that don't contain images. For example https://en.wikipedia.org/wiki/Borysthenia_goldfussiana, it doesn't contain any image but the ImageExtractorNew produced triple with image http://commons.wikimedia.org/wiki/Special:FilePath/T-72_ATE_South_Africa.jpg from it. The same issue with page https://en.wikipedia.org/wiki/Ingoldiomyces, it doesn't contain any picture but ImageExtractorNew also produced triple with image https://upload.wikimedia.org/wikipedia/commons/c/cf/B%26N_nook_Logo.svg

Pinpointing the source of the error

Where did you find the data issue? Non-exhaustive options are:

This error occurs in https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/ImageExtractorNew.scala

Details

please post the details

Wrong triples RDF snippet


<http://dbpedia.org/resource/Borysthenia_goldfussiana> <http://xmlns.com/foaf/0.1/depiction> <http://commons.wikimedia.org/wiki/Special:FilePath/T-72_ATE_South_Africa.jpg>

http://dbpedia.org/resource/Ingoldiomyces http://xmlns.com/foaf/0.1/depiction http://commons.wikimedia.org/wiki/Special:FilePath/B&N_nook_Logo.svg .

> Expected / corrected RDF outcome snippet 

We must remove that kind of triples
>Example DBpedia resource URL(s)

> Other
jaygray0919 commented 2 years ago

I have an extensive sample set that we can use to test when this issue is resolved /jay gray

jlareck commented 2 years ago

@jaygray0919 Could you please send this sample set? Looks like that I resolved the issue but not sure that completely (at least produced dataset doesn't contain <http://dbpedia.org/resource/Borysthenia_goldfussiana> and <http://dbpedia.org/resource/Ingoldiomyces> triples but it would be cool to check other wrong triples)

jaygray0919 commented 2 years ago

@jlareck try using this: https://afdsi.com/sparql-species/#/specierch/gold i can explain the app if you are interested /jay

jaygray0919 commented 2 years ago

@jlareck this also worked well 6 months ago, but is now very slow/unresponsive: https://afdsi.com/search-dbpedia-tv-shows/?#genre=&language=&country=& it seems/feels-like the parser is 'in a twist' do you see any obvious reasons for its sluggishness? /jay

jaygray0919 commented 2 years ago

@jlareck anything we can do to help out here? if possible, we'd like to feature this and related DBpedia/SPARQL apps as part of a product launch in the near future so we're motivated to help restore the images previously served by the SPARQL queries /jay

jlareck commented 2 years ago

try using this: https://afdsi.com/sparql-species/#/specierch/gold i can explain the app if you are interested /jay

Hi @jaygray0919, thank you for providing this link with examples! I checked some triples in the upcoming release image dataset and as I see some wrong images were not extracted but there are still some triples that contain images not related to the wikipage. So, the image extractor that produces the data is only partitially fixed.

if possible, we'd like to feature this and related DBpedia/SPARQL apps as part of a product launch in the near future so we're motivated to help restore the images previously served by the SPARQL queries

Could you please provide more details what do you want to do?

jaygray0919 commented 2 years ago

The url Species is one of our DBpedia/SPARQL applications. To reprise the above: "the content ain't right" Previously, when it "was right" the images for the queries were 100% correct (we checked extensively over a year ago - zero errors). Our request: restore the last good version. Now, we're not so naive to think that's easy; since the last solid data set, many changes have been applied. But the bottom line: DBpedia content has been corrupted. While we can determine that item images are corrupt, there may be other errors that also crept in somewhere during an update. It's highly unlikely that only the image files are fubar - my guess is that there are problems with other item properties. An indicator is the performance problems we see with another SPARQL application - TV Shows A year ago, this app worked very well. It is now very slow and produces irregular results. We're far more concerned with Species than TV Shows and are willing to "pitch in" and find the last good dataset (the version with uncorrupted image property values). Does that make sense? Anything short-term we can do to restore an uncorrupted dataset?

jlareck commented 2 years ago

Hi @jaygray0919, sorry, but it looks like we cannot restore uncorrupted dataset at the moment. Image dataset should have a better quality in the upcoming release, but it still contains some wrong triples. I am discovering those triples now, and we will try to fix image extraction till the next release

jaygray0919 commented 2 years ago

Got it. Then we'll be happy to work with you to incrementally identify misaligned images in the next release. Then you can use that list to correct a subsequent release. ITMT, the link we shared above will display - for biologics - misaligned images. It's a one-at-a-time process, but it might help you identify patterns that we cannot easily see (e.g. a consistent pairing of biologics/non-biologics). For example, there is a high concentration of military weapons in our biologic queries.

jlareck commented 2 years ago

Hi @jaygray0919, could you please check more images on your website if there are any incorrect images? Because it seems to me that I fixed the image extraction and all images should be correct. Thank you

jaygray0919 commented 2 years ago

Hello @jlareck - will do; will report back today/tomorrow Thank you for doing this work.

jaygray0919 commented 2 years ago

Previous errors that have been corrected: https://afdsi.com/sparql-species/#/specierch/gold https://afdsi.com/sparql-species/#/specierch/green https://afdsi.com/sparql-species/#/specierch/taurus

Small problems: https://afdsi.com/sparql-species/#/specierch/red Feredayia graminosa

I'll look for other errors later today

jaygray0919 commented 2 years ago

foaf:depiction

https://dbpedia.org/page/Pseudocharopa_whiteleggei https://commons.wikimedia.org/wiki/Special:Redirect/file/Lord_Howe_Island.png

https://dbpedia.org/page/Chiasmia_goldiei https://commons.wikimedia.org/wiki/Special:Redirect/file/Chiasmia_goldiei.jpg

https://dbpedia.org/page/Golden_volute https://commons.wikimedia.org/wiki/Special:Redirect/file/Iredalina_mirabilis.jpg

https://dbpedia.org/page/Pictured_rove_beetle https://commons.wikimedia.org/wiki/Special:Redirect/file/thinopinus_pictus.jpg

https://dbpedia.org/page/Tenthredo_amoena https://commons.wikimedia.org/wiki/Special:Redirect/file/Tenthredinidae_-_Tenthredo_amoena.jpg

https://dbpedia.org/page/Tenthredo_crassa https://commons.wikimedia.org/wiki/Special:Redirect/file/Tenthredinidae_-_Tenthredo_crassa-001.jpg

jlareck commented 2 years ago

Small problems: https://afdsi.com/sparql-species/#/specierch/red Feredayia graminosa

Actually, this is the correct image. Check the page https://en.wikipedia.org/wiki/Feredayia_graminosa , this article contains 3 images. I think that if the current version of image extraction extracts all pictures from wikipages, and produces multiple triples with foaf:depiction, you can show not only one picture but all those pictures on your website. Otherwise if you want to show only first picture from the wikipage, you can try to use dbo:thumbnail instead of foaf:depiction .

foaf:depiction

https://dbpedia.org/page/Pseudocharopa_whiteleggei https://commons.wikimedia.org/wiki/Special:Redirect/file/Lord_Howe_Island.png

https://dbpedia.org/page/Chiasmia_goldiei https://commons.wikimedia.org/wiki/Special:Redirect/file/Chiasmia_goldiei.jpg

And regarding to this, I think it is a one more issue in image extraction that I didn't notice before, but now it is related to creating incorrect links to wikimedia images

jaygray0919 commented 2 years ago

Unfortunately, your (sensible) exception handling is difficult to implement. We 'grab' the first instance and do not iterate on subsequent instances. And dimensions for dbo:thumbnail do not look good on desktop (they are passable on mobile, but we need to keep it simple).

Returning to the big picture, your corrections seem to handle the glaring issues (biologics like Russian tanks; aircraft; etc.) If you can correct the null values, that will further improve the display. Bottom line: queries are dramatically improved; thank you for that. /jay

JJ-Author commented 2 years ago

@jlareck good first milestone :-). but can you please write the documentation for the images dataset https://databus.dbpedia.org/dbpedia/generic/images/ and explain what to expect there. I think this is important knowledge for users to understand the difference between foaf:depiction, dbo:thumbnail and foaf:thumbnail. For me it is confusing I had to look in the code to get an impression that is not good...

@jaygray0919 thanks for testing and finding issues. But I do not understand your issue with multiple images, there seems no complexity in that, right? Just write the sparql query so that only one image is returned? or use thumbnail and cut off the size parameter at the end?

jaygray0919 commented 2 years ago

@JJ-Author I'll revist the SPARQL query, which has some age to it. When doing the original engineering, we did not see or foresee the need to test for more than one image; our single select on foaf:depiction worked 100% of the time. However, it will be much more difficult to read multiple properties and test for multiple images. Based on @jlareck corrections, we're ~90% of our previous results, which is acceptable. I'm reluctant to make an isolated change to a large program at this time. When we do reopen the beast, we'd like to add new features like autosuggest to limit the scope of the query. The current version hits DBpedia fairly hard, and we'd like to implement a more refined query. We'd also like to introduce a "You also may be interested in" using a reasoner (which, of course, adds back complexity). Bottom line: we'd like to help improve data quality thru testing, but postpone changes to the app until we have a new plan.

jlareck commented 2 years ago

@JJ-Author I made a pull request with the documentation for the image dataset: https://github.com/dbpedia/marvin-config/pull/4 . Could you please check it?