Open jlareck opened 2 years ago
I have an extensive sample set that we can use to test when this issue is resolved /jay gray
@jaygray0919 Could you please send this sample set? Looks like that I resolved the issue but not sure that completely (at least produced dataset doesn't contain <http://dbpedia.org/resource/Borysthenia_goldfussiana>
and <http://dbpedia.org/resource/Ingoldiomyces>
triples but it would be cool to check other wrong triples)
@jlareck try using this: https://afdsi.com/sparql-species/#/specierch/gold i can explain the app if you are interested /jay
@jlareck this also worked well 6 months ago, but is now very slow/unresponsive: https://afdsi.com/search-dbpedia-tv-shows/?#genre=&language=&country=& it seems/feels-like the parser is 'in a twist' do you see any obvious reasons for its sluggishness? /jay
@jlareck anything we can do to help out here? if possible, we'd like to feature this and related DBpedia/SPARQL apps as part of a product launch in the near future so we're motivated to help restore the images previously served by the SPARQL queries /jay
try using this: https://afdsi.com/sparql-species/#/specierch/gold i can explain the app if you are interested /jay
Hi @jaygray0919, thank you for providing this link with examples! I checked some triples in the upcoming release image dataset and as I see some wrong images were not extracted but there are still some triples that contain images not related to the wikipage. So, the image extractor that produces the data is only partitially fixed.
if possible, we'd like to feature this and related DBpedia/SPARQL apps as part of a product launch in the near future so we're motivated to help restore the images previously served by the SPARQL queries
Could you please provide more details what do you want to do?
The url Species
is one of our DBpedia/SPARQL applications. To reprise the above: "the content ain't right"
Previously, when it "was right" the images for the queries were 100% correct (we checked extensively over a year ago - zero errors).
Our request: restore the last good version.
Now, we're not so naive to think that's easy; since the last solid data set, many changes have been applied.
But the bottom line: DBpedia content has been corrupted.
While we can determine that item images are corrupt, there may be other errors that also crept in somewhere during an update. It's highly unlikely that only the image files are fubar - my guess is that there are problems with other item properties.
An indicator is the performance problems we see with another SPARQL application - TV Shows
A year ago, this app worked very well. It is now very slow and produces irregular results.
We're far more concerned with Species
than TV Shows
and are willing to "pitch in" and find the last good dataset (the version with uncorrupted image property values).
Does that make sense? Anything short-term we can do to restore an uncorrupted dataset?
Hi @jaygray0919, sorry, but it looks like we cannot restore uncorrupted dataset at the moment. Image dataset should have a better quality in the upcoming release, but it still contains some wrong triples. I am discovering those triples now, and we will try to fix image extraction till the next release
Got it. Then we'll be happy to work with you to incrementally identify misaligned images in the next release. Then you can use that list to correct a subsequent release. ITMT, the link we shared above will display - for biologics - misaligned images. It's a one-at-a-time process, but it might help you identify patterns that we cannot easily see (e.g. a consistent pairing of biologics/non-biologics). For example, there is a high concentration of military weapons in our biologic queries.
Hi @jaygray0919, could you please check more images on your website if there are any incorrect images? Because it seems to me that I fixed the image extraction and all images should be correct. Thank you
Hello @jlareck - will do; will report back today/tomorrow Thank you for doing this work.
Previous errors that have been corrected: https://afdsi.com/sparql-species/#/specierch/gold https://afdsi.com/sparql-species/#/specierch/green https://afdsi.com/sparql-species/#/specierch/taurus
Small problems: https://afdsi.com/sparql-species/#/specierch/red Feredayia graminosa
I'll look for other errors later today
foaf:depiction
https://dbpedia.org/page/Pseudocharopa_whiteleggei https://commons.wikimedia.org/wiki/Special:Redirect/file/Lord_Howe_Island.png
https://dbpedia.org/page/Chiasmia_goldiei https://commons.wikimedia.org/wiki/Special:Redirect/file/Chiasmia_goldiei.jpg
https://dbpedia.org/page/Golden_volute https://commons.wikimedia.org/wiki/Special:Redirect/file/Iredalina_mirabilis.jpg
https://dbpedia.org/page/Pictured_rove_beetle https://commons.wikimedia.org/wiki/Special:Redirect/file/thinopinus_pictus.jpg
https://dbpedia.org/page/Tenthredo_amoena https://commons.wikimedia.org/wiki/Special:Redirect/file/Tenthredinidae_-_Tenthredo_amoena.jpg
https://dbpedia.org/page/Tenthredo_crassa https://commons.wikimedia.org/wiki/Special:Redirect/file/Tenthredinidae_-_Tenthredo_crassa-001.jpg
Small problems: https://afdsi.com/sparql-species/#/specierch/red Feredayia graminosa
Actually, this is the correct image. Check the page https://en.wikipedia.org/wiki/Feredayia_graminosa , this article contains 3 images. I think that if the current version of image extraction extracts all pictures from wikipages, and produces multiple triples with foaf:depiction
, you can show not only one picture but all those pictures on your website. Otherwise if you want to show only first picture from the wikipage, you can try to use dbo:thumbnail
instead of foaf:depiction
.
foaf:depiction
https://dbpedia.org/page/Pseudocharopa_whiteleggei https://commons.wikimedia.org/wiki/Special:Redirect/file/Lord_Howe_Island.png
https://dbpedia.org/page/Chiasmia_goldiei https://commons.wikimedia.org/wiki/Special:Redirect/file/Chiasmia_goldiei.jpg
And regarding to this, I think it is a one more issue in image extraction that I didn't notice before, but now it is related to creating incorrect links to wikimedia images
Unfortunately, your (sensible) exception handling is difficult to implement.
We 'grab' the first instance and do not iterate on subsequent instances.
And dimensions for dbo:thumbnail
do not look good on desktop (they are passable on mobile, but we need to keep it simple).
Returning to the big picture, your corrections seem to handle the glaring issues (biologics like Russian tanks; aircraft; etc.) If you can correct the null values, that will further improve the display. Bottom line: queries are dramatically improved; thank you for that. /jay
@jlareck good first milestone :-). but can you please write the documentation for the images dataset https://databus.dbpedia.org/dbpedia/generic/images/ and explain what to expect there. I think this is important knowledge for users to understand the difference between foaf:depiction, dbo:thumbnail and foaf:thumbnail. For me it is confusing I had to look in the code to get an impression that is not good...
@jaygray0919 thanks for testing and finding issues. But I do not understand your issue with multiple images, there seems no complexity in that, right? Just write the sparql query so that only one image is returned? or use thumbnail and cut off the size parameter at the end?
@JJ-Author I'll revist the SPARQL query, which has some age to it.
When doing the original engineering, we did not see or foresee the need to test for more than one image; our single select on foaf:depiction
worked 100% of the time.
However, it will be much more difficult to read multiple properties and test for multiple images.
Based on @jlareck corrections, we're ~90% of our previous results, which is acceptable.
I'm reluctant to make an isolated change to a large program at this time.
When we do reopen the beast, we'd like to add new features like autosuggest to limit the scope of the query.
The current version hits DBpedia fairly hard, and we'd like to implement a more refined query.
We'd also like to introduce a "You also may be interested in" using a reasoner (which, of course, adds back complexity).
Bottom line: we'd like to help improve data quality thru testing, but postpone changes to the app until we have a new plan.
@JJ-Author I made a pull request with the documentation for the image dataset: https://github.com/dbpedia/marvin-config/pull/4 . Could you please check it?
Issue validity
https://dbpedia.org/page/Borysthenia_goldfussiana https://dbpedia.org/page/Ingoldiomyces There are more triples in the DBpedia snapshot 2021-09 that contain this issue
Error Description
Looks like ImageExtractorNew produces triples from Wikipedia pages that don't contain images. For example https://en.wikipedia.org/wiki/Borysthenia_goldfussiana, it doesn't contain any image but the ImageExtractorNew produced triple with image http://commons.wikimedia.org/wiki/Special:FilePath/T-72_ATE_South_Africa.jpg from it. The same issue with page https://en.wikipedia.org/wiki/Ingoldiomyces, it doesn't contain any picture but ImageExtractorNew also produced triple with image https://upload.wikimedia.org/wikipedia/commons/c/cf/B%26N_nook_Logo.svg
Pinpointing the source of the error
This error occurs in https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/ImageExtractorNew.scala
Details
http://dbpedia.org/resource/Ingoldiomyces http://xmlns.com/foaf/0.1/depiction http://commons.wikimedia.org/wiki/Special:FilePath/B&N_nook_Logo.svg .