extract more useful images

VladimirAlexiev commented 7 years ago

This bug is about extracting more useful images from Wikipedia. This PR is related: https://github.com/dbpedia/extraction-framework/pull/470.

The Berlin page as it was extracted in dbpedia has a number of images on wikipedia:

{{Infobox German state
|state_coa=Coat of arms of Berlin.svg
|flag=Flag_of_Berlin.svg
|map=Berlin in Germany and EU.png
# a whole collage/montage
|image_photo={{Photomontage|position=center
| photo1a = Siegessaeule Aussicht 10-13 img4 Tiergarten.jpg
| photo2a = Brandenburger Tor abends.jpg
| photo2b = Journalists during the Berlin Film Festival in 2008.jpg
| photo3a = East Side Gallery.JPG
| photo3b = Alte Nationalgalerie Berlin, 2011.jpg
| photo4a = Reichstag building Berlin view from west before sunset.jpg
{{multiple image
|image2=Über den Dächern von Berlin.jpg
...
{{multiple image
|image2=20150208 - Berlinale Palast and Red Carpet.JPG

These are extracted as follows http://dbpedia.org/page/Berlin:

dbp:stateCoa "Coat of arms of Berlin.svg"
dbp:flag "Flag_of_Berlin.svg"
# these below are actionable links:
foaf:depiction wiki-commons:Special:FilePath/Coat_of_arms_of_Berlin.svg
dbo:thumbnail wiki-commons:Special:FilePath/Coat_of_arms_of_Berlin.svg?width=300
dbp:image 20150208, 
  "Alte Nationalgalerie abends .jpg",
  "Cityscape Berlin.jpg,
  "Franziska Knuppe.jpg",
  "IFA 2012 IMG 7677.JPG",
  "Über den Dächern von Berlin.jpg".

Bugs:

The multiple image template extractor should extract filenames fully (not stop at digits) and turn them into actonable links wiki-commons:Special:FilePath/*
A similar extractor should extract images from the photomontage: image_photo={{Photomontage|...photo[0-9]+[a-z]*=
IMHO dbp:flag should also become an actonable link and be placed in foaf:depiction, like dbp:stateCoa

The Lindsay Anderson page as it was extracted in dbpedia has the following image-related info on wikipedia:

{{Infobox person
| image       = Lindsay_anderson.jpg
| imagesize   = 215px

http://dbpedia.org/page/Lindsay_Anderson extracts only dbp:imagesize.

Bug: extract dbp:image and turn it into an actonable link wiki-commons:Special:FilePath/*. Note: #133 discusses images in the en namespace vs the commons namespace, so "actionable link" may not always mean "prepend a commons namespace", someone needs to research this.

chile12 commented 7 years ago

Hope to have a look at this before the next extraction.

Termilion commented 7 years ago

I was told to have a look at the ImageExtractor for this issue. Surprisingly the problems are not the caused by the ImageExtractor, because it extracts images to foaf:depiction instead of dbp:image. But since it is still in use anyway, I ran a few tests and reworked the code to find and extract more images. In the case of the mentioned Berlin page, the old extractor only extracted one image. The reworked extractor now extracts a total of 94 images from the Berlin page. The links are generated in the way you described and should be working as intended. Used on the complete german article dump it extracts 22 million triples, which is about 3,5 times as many images as the old version. The missing picture on the Lindsay Anderson page should be filtered due to its non-free copyright license, so that is not a bug.

VladimirAlexiev commented 7 years ago

@Termilion Could I take a look at the images for Berlin? Hundreds of images are not necessarily a good thing.

non-free copyright license

You're right. Checked::

https://en.wikipedia.org/wiki/Barack_Obama has image
https://en.wikipedia.org/wiki/File:President_Barack_Obama.jpg that is in the public domain.
http://dbpedia.org/page/Barack_Obama has it in foaf:depiction and foaf:thumbnail

Termilion commented 7 years ago

@VladimirAlexiev I definitely see your point. At first my goal was to get as many images as possible without thinking about the importance of the images.

Currently my code traverses the page tree recursively, limited by the configurable max depth. My first idea was to simply reduce the recursion depth (since more important images should appear less embedded), but that didn't have as much of an effect as I predicted. Without recursion we still have 78 Images. If we want to narrow it down any further, I would need to implement a check for specific patterns in which useful images appear. Let me know your thoughts about this, and I'll have a look at possible solutions.

Here is the List of Images extracted from the Berlin page:

Wall of text instead of file, because links are more convenient on github than in a txt file. If there is a better way, let me know and I'll edit.
Max Recursion Depth: 5, Total Number of images: 97

VladimirAlexiev commented 7 years ago

You've done a lot more than just images from infoboxes!

links here are perfect
I checked a few and all seem relevant
what's the page tree, and why do you need recursion? Can you list the 19 pages that don't come from the page itself (come from levels 2..5)? That would be a more focused way of checking.
a few are not deduplicated, eg http://de.wikipedia.org/wiki/Datei:Coat_of_arms_of_Berlin.svg so the number is a bit smaller
I think it's better to reformat to links of actual images (called above "actionable"). For ENwiki that's eg wiki-commons:Special:FilePath/Coat_of_arms_of_Berlin.svg, but I'm not sure how it should be done for DEwiki
can you try your extractor on ENwiki? To check whether it catches practices used on ENwiki (eg photomontage and "multiple image")
do you handle "special images" differently: map, coat of arms, flag, signature

Termilion commented 7 years ago

@VladimirAlexiev Is it only supposed to get images from the infoboxes? That would be quite a big misunderstanding on my part, but would explain some strange design choices in the old code.

[x] Oh, I accidentally printed out the wrong URLs. They are in fact correctly build the way you proposed.
[x] It should work on EN the same as on DE, since it doesn't rely on the kind of templates used, just of the type of content (text, Link, Table, stuff like that) but I'll run a test anyway. Done -> Result down in EDIT 1
[x] dublicate check now in place. Now we have 94 total images for Berlin.
[x] handling special images is not implemented right now, but shouldn't be a problem. I'll do this after I am sure everything is working correctly and as intended.

The ImageExtractor is an old Extractor that works on the Article dumps. What I called "page tree" is the extractors internal representation of a WikiPage: A so-called PageNode with children that can be Text-/Link-/Table-/... nodes which may have children on their own. Images are afaik only in link or text nodes, for every other type of node I call the method again to check their children for these node types, that's the recursion I was talking about. (This way I'll i.e. get every Image that might be contained in a Table or something like that) I just improved the way the ImageExtractor uses this structure and finds images, I didn't want to change the base concept of it too dramatically.

The Images only found with the recursion:

EDIT 1: I ran a test on Barack Obama. Everything seems to be working fine for EN.

VladimirAlexiev commented 7 years ago

Yes, the old extractor only gets images from the infoboxes. Yours is better
but handling the infoboxes specially is also needed, so you can extract: main image and special images. See https://github.com/dbpedia/ontology-tracker/issues/19 for a discussion how to represent them
EN vs DE: it's great that images in the page are extracted the same, and your formatting is now great.
- However, try it on enwiki:Berlin. This uses "photocollage" and "multi image" in the template, do you catch those?
got it about "page tree". Then you need to get all images from all levels, as shown by your "spillover list"

Termilion commented 7 years ago

[x] Yes, both photomontage- and multi image-images are found. I ran a quick test on enwiki:Berlin and found every image I was looking for.
[x] Ok, if we need to get every image, I'll set the standard "max depth" a bit higher. Which won't be a problem, since I already ran tests with very high values and the extraction still finished in a reasonable amount of time
[x] I just implemented some file name regex checks for the special images and I am using the first image as main image. This will be the first image from the infobox in most cases. Not a perfect solution, but it works for now. i.e:
[x] Now only the triple generation for the special images / main image needs to be implemented.
Do we exclude special images from the normal image list? (i.e. not using map image in dbo:image, anymore because it'll be in something:map?)

VladimirAlexiev commented 7 years ago

Excellent work @Termilion and worth to present at Semantics 2017 DBpedia day, if you're going there.

I wonder how this harvest from the page compares to Commons lists:

This is extracted as RDF:

http://commons.dbpedia.org/page/Berlin: but I don't see any image links?
http://commons.dbpedia.org/page/Category:Berlin: subcats as "skos:broader of", images as "dcterms:subject of" but mixed up with commons pages and categories

Do we exclude special images from the normal image list?

I'd say keep them.

@chile12 and @jimkont, how to approach https://github.com/dbpedia/ontology-tracker/issues/19? Maybe you can add it as an item for the meeting? (I won't be there).

Termilion commented 7 years ago

Thanks @VladimirAlexiev, but sadly I won't be able to go to the Semantics this year.

Triple generation is now in place, someone just needs to update the properties for the special images, once #19 is discussed.
special images are kept in the normal-image-list
code clean up

This should now be ready to be merged after the properties are updated.

m1ci commented 4 years ago

@VladimirAlexiev recently we have introduced a testing methodology, see our submission for semantics https://svn.aksw.org/papers/2020/semantics_marvin/public.pdf

So most of the issues can be captured there. My question is: is there smth from this thread that we can define as test? \cc @Vehnem

dbpedia / extraction-framework

extract more useful images #515