OneZoom / tree-build

Scripts for assembling the tree, metadata and downstream data products such as popularity and popular images
MIT License
1 stars 2 forks source link

Harvesting vernacular names and images from wikidata #38

Open hyanwong opened 9 months ago

hyanwong commented 9 months ago

The image and vernacular names on OneZoom come from EoL. The harvester that pings the EoL API needs a re-write, but we may also want to get images and vernacular names from wikimedia commons. In https://github.com/OneZoom/tree-build/issues/24#issuecomment-1824455531 we decided that this could be done by looking at images selected for a taxon on wikidata. Many taxa on wikidata have an "exemplar" image, e.g.

https://www.wikidata.org/wiki/Q140 has this one

We could probably use the trimmed-down WD json dump to grab the image URL, and save that image (or a trimmed down version) under the name of the wikidata Q-id, e.g. for the lion we would harvest that image under static/FinalOutputs/img/20/140/140.jpg (because 20 is the numerical ID I arbitrarily chose to represent images from wiki sites (see here). If we use the Q-id for the image name, it means that we can match it up to an OTT later. Note that this does mean that we would need to match up URLs to work out when the image on wikidata has been changed.

We decided that we probably want to take the original image, shrink it to a jpg of something like 150k max size (or maybe so that the vertical dimension or the horizontal dimension is a max of (say) 500px), and we could embed the copyright holder, licence, and original URL in the EXIF or IMPC metadata. We can then decide in a separate step how to crop down the image to something more suitable for thumbnails.

davidebbo commented 9 months ago

Looks good. Quick thought: we might want to somewhat decouple the vernacular from the image harvesting, since images are language agnostic, while vernacular names are not.

hyanwong commented 9 months ago

Good point. Although the vernacular names are all stored in the same JSON file. I linked them together previously because both were returned from the same EoL API call. If we are using a Wikidata download dump, that doesn't matter.

hyanwong commented 9 months ago

Also, I just realised that I might have opened this issue in the wrong repo! But I guess we could template it here and move the harvesting code over, if necessary?

davidebbo commented 9 months ago

I can start here, if only to do initial investigation, and we can then decide where it belongs.

davidebbo commented 9 months ago

Looking at the wikidata dump, one challenge is that I'm not seeing the image licence info in there. The reason might be that the licence info is on wikimedia.org rather than wikidata (e.g. on https://commons.wikimedia.org/wiki/File:Lion_waiting_in_Namibia.jpg). If that's the case, we would not be able to embed the licence in the EXIF based solely on the wikidata dump.

Here is the relevant chunk that has the primary Lion image:

        "P18": [
            {
                "mainsnak": {
                    "snaktype": "value",
                    "property": "P18",
                    "datavalue": {
                        "value": "Lion waiting in Namibia.jpg",
                        "type": "string"
                    },
                    "datatype": "commonsMedia"
                },
                "type": "statement",
                "qualifiers": {
                    "P21": [
                        {
                            "snaktype": "value",
                            "property": "P21",
                            "hash": "0576a008261e5b2544d1ff3328c94bd529379536",
                            "datavalue": {
                                "value": {
                                    "entity-type": "item",
                                    "numeric-id": 44148,
                                    "id": "Q44148"
                                },
                                "type": "wikibase-entityid"
                            },
                            "datatype": "wikibase-item"
                        }
                    ],
                    "P276": [
                        {
                            "snaktype": "value",
                            "property": "P276",
                            "hash": "443b9d8de4c1e34c63279ab2adf551ccbe2f2898",
                            "datavalue": {
                                "value": {
                                    "entity-type": "item",
                                    "numeric-id": 1030,
                                    "id": "Q1030"
                                },
                                "type": "wikibase-entityid"
                            },
                            "datatype": "wikibase-item"
                        }
                    ]
                },
                "qualifiers-order": [
                    "P21",
                    "P276"
                ],
                "id": "q140$5903FDF3-DBBD-4527-A738-450EAEAA45CB",
                "rank": "normal"
            }
        ]
hyanwong commented 9 months ago

The reason might be that the licence info is on wikimedia.org rather than wikidata

Ah, yes indeed. We would need to harvest this information too. But we could probably do that using an API query, as there won't be huge numbers of these images, I think (maybe 50k?). Are the hashes in there hashes of the image (in which case we would know when it changed), or something else?

Even if not, we could probably rely on the fact that the names are "unique enough" so that if the filename changes we re-download the image info.

davidebbo commented 9 months ago

Are the hashes in there hashes of the image (in which case we would know when it changed), or something else?

I don't think they are hashes of the image. Instead, they are part of the 'qualifiers' of the image. P21 is the gender (in this case set to male orgamism, while P276 is the location (in this case Namibia). I don't know exactly what the hashes are, but they somehow relate to these qualifiers, and are likely of no use to us.

Instead, the standard way to get resource hashes is via etags. e.g. here if you request https://upload.wikimedia.org/wikipedia/commons/7/73/Lion_waiting_in_Namibia.jpg, you will see this response header:

Etag: 8f0c085eacb04f598d19798e3622e179

If the resource were to change, so would the etag.

we could probably do that using an API query, as there won't be huge numbers of these images, I think (maybe 50k?)

Yes, let's explore this approach for the licence.

We should be able to use async I/O to avoid blocking on every web request, which will speed things up (basically you request many at once). This goes both for downloading the image itself, and for any API calls. I've done this in C#, but it seems Python now supports it.

davidebbo commented 9 months ago

Looking at this some more, I think what we really need is a flavor of EoLQueryPicsNames.py that works with Wikimedia instead of EoL. i.e. it needs to not only download the images, but also update the DB (images_by_ott).

The general flow to get images looks like this (with Lion as an example):

We can also get all images within a category, e.g. https://commons.wikimedia.org/w/api.php?action=query&list=categorymembers&cmtype=file&cmtitle=Category:Panthera%20leo&cmlimit=max&format=json

hyanwong commented 4 months ago

Use our DB to get the wiki ID from the OTT.

Just to note that the API/otts2identifiers API does this if we don't want to plug directly into the DB:

https://www.onezoom.org/API/otts2identifiers?key=0&otts=770315,244265,542509

However, as we are updating the DB once we harvest, I guess there's no advantage to going via the API like this

hyanwong commented 4 months ago

From the wiki ID, get the images info. This can be done in 2 ways:

I see what you mean about using the wikipedia API. However, I can think of two reasons to use the dump:

  1. We won't know beforehand if a Wikidata item has an image, so we are potentially checking every species with a Wikidata number in the OneZoom DB. This could run into the millions, which is a lot of API requests to hit Wikimedia (we could get banned).
  2. We want a semi-continuous algorithm that checks for new images that are arriving. Obviously we can't check all the X million species every day via the API. But we can check against all species whenever we download a new JSON dump file.

There are eol_inspected and eol_updated fields which we could repurpose to indicate when we last checked a species on either eol or wikipedia. So we could also take an indirect route, and run a script against the WD JSON dump which places a flag in one of those tables to indicate that this species needs updating via a check to the Wikidata API?

I think this harvester is an important thing to discuss at an online meeting?

davidebbo commented 4 months ago

Yep, I think that makes sense. So if we go off of the dump, then in theory there is no need to run a continuous updater, right? i.e. new dump, new batch of image updates (most will of course be unchanged) and we stay with them till the next dump. That's assuming we can fully rely on wikidata and not have to also use EOL.

Note that we'll still need to call the wikimedia API to get the licence info and image URL. But you make a great point that we can avoid any API calls for wikidata entries that don't have any images (presumably a large percentage).

Sounds good to discuss further in a meeting.

hyanwong commented 4 months ago

if we go off of the dump, then in theory there is no need to run a continuous updater, right?

Exactly.

davidebbo commented 4 months ago

Some stats using latest wiki dump:

So that's about 8%, and at most we need to make ~300K wikimedia API calls to get images for one dump.

In practice, we can probably assume that if we already have an image by the same name as what's in the dump, we can just stay with it and not make any API calls. e.g. images have names like "002 The lion king Snyggve in the Serengeti National Park Photo by Giles Laurent.jpg". I suppose that in theory, the image could be updated while preserving its name, but I suspect that's rare.

With that simplifying assumption, the number of entries that need to be processed when we get a new dump should be quite small. Mostly:

@hyanwong how many entries have images today based on EoL?

hyanwong commented 4 months ago

Great, really useful stats, thanks @davidebbo. I agree that we can probably assume that if the name is the same, it hasn't changed (I guess we could have a flag to force an individual re-load if necessary).

how many entries have images today based on EoL?

We have 105461 images for 85828 OTTs. So it looks like we could do pretty well if we took the wikipedia images.

Three slight issues

  1. the wikipedia images don't have a quality rating, but I suspect they are mostly hand-picked, and we can assume the rating is pretty good. We might want to allow the old EoL ones with a rating of e.f. above 45000 (max is 50000, default is 25000) to take priority. We can also change the wiki image score by hand if e.g. we want to percolate specific images up the tree, as representative ones.
  2. The wikipedia images will need to be cropped square. Sometimes this cuts stuff off strangely. I wonder if we want to be able to specify a crop region (e.g. in the database). This would presumably have a default, but we could override it.
  3. We flag the images into 3 overlapping categories: "any", "verified" and "public domain" (a good PD image of the correct species would have all 3 flags set). I guess we can probably treat all the wikipedia images as "verified" and "any", assuming they have the right license (e.g. we usually only take CC-licences, and not e.g. ones that are solely licensed under GPL). We would only flag them as "pd" if they had a PD licence. It's worth keeping the old EoL images in the DB, because there will probably be many species that have a PD image from EoL, but where the wiki image is e.g. CC-BY-SA.

If we do manage to write a harvesting script, it would be helpful to browse the original images and compare to the cropped ones, and also assess the general quality.

davidebbo commented 4 months ago

Yes, the cropping part is quite a challenge, which hit me with extinct creatures (sauropods had very long necks! 😆). I don't think we want to be in the business of manually choosing a crop region for 300K images, so we'll at least want a very good default. And that would require some smart AI tool, as it's a hard problem. I have not researched this, and maybe there is something out there.

How are we doing this with EoL today? Are they offering squared images to begin with?

davidebbo commented 4 months ago

Looking on OneZoom tree, the current cropping is not always great. E.g. the Clouded Leopard is missing an eye. The Pantana cat is missing its head entirely. So maybe we're just doing center cropping and it does whatever it does?

I don't know what the sources looked like, because they're no longer available (e.g. http://media.eol.org/content/2014/10/20/03/35031_orig.jpg).

hyanwong commented 4 months ago

Looking on OneZoom tree, the current cropping is not always great. E.g. the Clouded Leopard is missing an eye. The Pantana cat is missing its head entirely. So maybe we're just doing center cropping and it does whatever it does?

Yes, there used to be the ability to specify a crop area by hand on EoL, and I did this for ~ 1000 images in which the default crop was bad. They removed this ability in EoL v2, but kept the old crop positioning in the DB for backwards compatibility. The images you identified as poorly cropped are likely to be among the majority of images that I missed.

In the long term, this is definitely an AI job, although e.g. the differences between plants and animals will probably make this pretty hard to generalise.

hyanwong commented 4 months ago

We simply tried to tackle the odd problem by hand as it occurred, but I haven't been keeping up with this. As well as the two that you pointed out, the lion is pretty poor, and we should probably fix this by hand, since it's such an iconic species.

As you say, it's impossible to do this for 300k images, and it should be possible using AI routines, although I don't know of any. AI routines might help with the rating too. If we do find a workable solution, especially if it is open source, I'm sure that EoL would be interested, and might even provide funding to someone to implement it across their site (I helped write the Wikimedia harvesting code for EoL at one point, purely voluntarily)

davidebbo commented 4 months ago

If we do find a workable solution, especially if it is open source, I'm sure that EoL would be interested, and might even provide funding to someone to implement it across their site

Yes, although I feel that somewhat goes against the direction of this discussion. With only finite time, if we decide to invest in getting images from wikimedia (with the assumption that they are better), then it may not make sense to also invest in EoL.

In a perfect world, EoL themselves would rebase their image story on wikimedia (or join them in some way), to avoid having to solve the same problems twice.

hyanwong commented 4 months ago

Yes, sorry, I didn't make that clear. I'm not saying that we should be involved in implementing it for EoL, but if we did find a workable solution for cropping (and rating) wikimedia images, I'm sure the developers at EoL would want to know. What they then did with that would be up to them. It's worth giving them some tips as to what works, so we are acting in a collaborative manner with the major players in this space.

hyanwong commented 4 months ago

I agree, by the way, that a re-write of the harvester should be focussed on getting images from wikimedia commons rather than EoL.

davidebbo commented 4 months ago

Got it, thanks for clarifying.

hyanwong commented 4 months ago

Just to note here that we decided on Slack that cropping images to size 300x300 px would be better than the current default of 150x150px. @jrosindell is also keen to store the original (uncrossed) image if possible. I suppose we could do this and simply not serve that image on the normal OneZoom page (we have a reasonable about of disk space on our server: about 7.6T available!)

We also think that getting decent drops is probably a matter of waiting for the AI to catch up, but we might as well store the crop positions in the database before that, so that e.g. we can adjust them by hand.

davidebbo commented 4 months ago

I just played around with the newest beta version of Azure Image Analysis service, and it actually did a rather decent job with square cropping. It's not perfect, but it's the best I've found yet. It did well on mammals, and then did poorly on this shark.

If you want me to try it on a specific image, just send me a link!

davidebbo commented 4 months ago

One nice thing about this Cloud API is that you don't need to upload your pictures, nor download the results. Instead:

hyanwong commented 4 months ago

Just a follow-up on image naming conventions:

  • Download the actual image and save it, e.g. under static/FinalOutputs/img/20/140/140.jpg (though we might need a better scheme to avoid having one folder per species).

The folder within which an image such as 140.jpg lives, is simply based on the last 3 digits of the name. So if we used the QID for the image name e.g. the blue whale, Balaenoptera musculus (Q42196) would saved under static/FinalOutputs/img/20/196/42196.jpg. Since most QIDs are more than 3 digits, it means we don't create one folder per species.

The rationale for the subfolders was to make sure that we didn't have a single folder with 100,000 images in it, which slows down filesystem access. I picked the last 3 digits as these are more likely to be random, and so I hoped that we would have roughly the same number of images in each of the 1000 three-digit subfolders. I think it's working OK so far, so I don't see a compelling reason to change this system.

It's helpful to name the images using an integer from -2,147,483,648 to 2,147,483,647 (4 byte int in MYSQL). There are currently 110 million QIDs, so if we use the QID as the identifier, we can deal with ~20 times more wikidata information before needing to change to using an SQL BIGINT in the database.

Alternatively, we could simply allocate a new (auto-incremented) number for the image name every time we harvested a new image. This would mean we could keep the older images available, rather than overwriting the blue whale image every time it changes on wikidata. I think the ability to tie the image number to the WD item is quite valuable, however.

For manually harvested images, we could conceivably have two images for a single wikidata item (a public domain and a non-public domain image), so perhaps we would want to use the auto-incremented number for those, or e.g. use a negative number for the equivalent PD images (urgh!)

davidebbo commented 4 months ago

Yes, now I recall that with my early 'extinct species' experiment, I struggled with the fact that wikimedia images don't have IDs per se. They just have their file name (e.g. "Bluewhale877.jpg", which are unique within wikimedia). But those file names don't translate well into our integer source ID column. So I ended up hashing them, which can then run into clashes.

I like the simplicity of using the wikidata ID as the image id, with the obvious quirk that there can be multiple images for one WD ID, and that they can change over time. In my 'extinct species' case, I was actually directly using wikipedia and not wikidata, so it's not clear that I could always associate a WD if. And side note: wikipedia pages also have an ID, completely distinct for the WD id (e.g. see https://en.wikipedia.org/w/index.php?title=Blue_whale&action=info).

And if we end up using generated IDs as well in some scenarios, we'll need to allocate a different 'src' for it (e.g. 20 vs 21), since the field semantics will be different.