Open VladimirAlexiev opened 7 years ago
Hope to have a look at this before the next extraction.
I was told to have a look at the ImageExtractor for this issue. Surprisingly the problems are not the caused by the ImageExtractor, because it extracts images to foaf:depiction
instead of dbp:image
. But since it is still in use anyway, I ran a few tests and reworked the code to find and extract more images.
In the case of the mentioned Berlin page, the old extractor only extracted one image.
The reworked extractor now extracts a total of 94 images from the Berlin page. The links are generated in the way you described and should be working as intended.
Used on the complete german article dump it extracts 22 million triples, which is about 3,5 times as many images as the old version.
The missing picture on the Lindsay Anderson page should be filtered due to its non-free copyright license, so that is not a bug.
@Termilion Could I take a look at the images for Berlin
? Hundreds of images are not necessarily a good thing.
non-free copyright license
You're right. Checked::
@VladimirAlexiev I definitely see your point. At first my goal was to get as many images as possible without thinking about the importance of the images.
Currently my code traverses the page tree recursively, limited by the configurable max depth. My first idea was to simply reduce the recursion depth (since more important images should appear less embedded), but that didn't have as much of an effect as I predicted. Without recursion we still have 78 Images. If we want to narrow it down any further, I would need to implement a check for specific patterns in which useful images appear. Let me know your thoughts about this, and I'll have a look at possible solutions.
Here is the List of Images extracted from the Berlin page:
You've done a lot more than just images from infoboxes!
@VladimirAlexiev Is it only supposed to get images from the infoboxes? That would be quite a big misunderstanding on my part, but would explain some strange design choices in the old code.
Berlin
.The ImageExtractor is an old Extractor that works on the Article dumps. What I called "page tree" is the extractors internal representation of a WikiPage: A so-called PageNode with children that can be Text-/Link-/Table-/... nodes which may have children on their own. Images are afaik only in link or text nodes, for every other type of node I call the method again to check their children for these node types, that's the recursion I was talking about. (This way I'll i.e. get every Image that might be contained in a Table or something like that) I just improved the way the ImageExtractor uses this structure and finds images, I didn't want to change the base concept of it too dramatically.
The Images only found with the recursion:
EDIT 1: I ran a test on Barack Obama. Everything seems to be working fine for EN.
[x] Yes, both photomontage
- and multi image
-images are found. I ran a quick test on enwiki:Berlin
and found every image I was looking for.
[x] Ok, if we need to get every image, I'll set the standard "max depth" a bit higher. Which won't be a problem, since I already ran tests with very high values and the extraction still finished in a reasonable amount of time
[x] I just implemented some file name regex checks for the special images and I am using the first image as main image
. This will be the first image from the infobox in most cases. Not a perfect solution, but it works for now.
i.e:
enwiki:Berlin
we will have:
enwiki:New_York_City
:
enwiki:Barack_Obama
:
[x] Now only the triple generation for the special images / main image needs to be implemented.
Do we exclude special images from the normal image list? (i.e. not using map image in dbo:image
, anymore because it'll be in something:map
?)
Excellent work @Termilion and worth to present at Semantics 2017 DBpedia day, if you're going there.
I wonder how this harvest from the page compares to Commons lists:
This is extracted as RDF:
Do we exclude special images from the normal image list?
I'd say keep them.
@chile12 and @jimkont, how to approach https://github.com/dbpedia/ontology-tracker/issues/19? Maybe you can add it as an item for the meeting? (I won't be there).
Thanks @VladimirAlexiev, but sadly I won't be able to go to the Semantics this year.
This should now be ready to be merged after the properties are updated.
@VladimirAlexiev recently we have introduced a testing methodology, see our submission for semantics https://svn.aksw.org/papers/2020/semantics_marvin/public.pdf
So most of the issues can be captured there. My question is: is there smth from this thread that we can define as test? \cc @Vehnem
This bug is about extracting more useful images from Wikipedia. This PR is related: https://github.com/dbpedia/extraction-framework/pull/470.
The Berlin page as it was extracted in dbpedia has a number of images on wikipedia:
These are extracted as follows http://dbpedia.org/page/Berlin:
Bugs:
multiple image
template extractor should extract filenames fully (not stop at digits) and turn them into actonable linkswiki-commons:Special:FilePath/*
image_photo={{Photomontage|...photo[0-9]+[a-z]*=
The Lindsay Anderson page as it was extracted in dbpedia has the following image-related info on wikipedia:
http://dbpedia.org/page/Lindsay_Anderson extracts only
dbp:imagesize
.Bug: extract
dbp:image
and turn it into an actonable linkwiki-commons:Special:FilePath/*
. Note: #133 discusses images in theen
namespace vs thecommons
namespace, so "actionable link" may not always mean "prepend acommons
namespace", someone needs to research this.