OneZoom / tree-build

Scripts for assembling the tree, metadata and downstream data products such as popularity and popular images
MIT License
1 stars 2 forks source link

Popularity is broken when the wikidata entry points to a wikipedia redirect #59

Open davidebbo opened 4 months ago

davidebbo commented 4 months ago

[This is a spin-off issue from #49]

Another weird case: Leopardus pajeros is the Pampas cat.

Problem is that the wikidata entry has the english link linking to Leopardus_pajeros (which is a redirect), instead of to the main Pampas cat page: "enwiki": { "title": "Leopardus pajeros" }. So we end up looking up Leopardus pajeros in the Page Count file, and not finding anything, because all the hits are with the Pampas cat entry.

davidebbo commented 4 months ago

[Copied from @hyanwong's comment in #49]

has the english link linking to Leopardus_pajeros (which is a redirect)

So we can either change the link in Wikidata and/or somehow keep redirects in the wikipedia dump, and follow those?

I seem to remember that I did do a bit of redirect-following in my original wikimedia parser, but I think I didn't think about that case on wikipedia.

Yan

davidebbo commented 4 months ago

change the link in Wikidata

Interestingly, if you look at the links on https://www.wikidata.org/wiki/Q311417, next to 'en', there is a little arrow symbol. If you hover over it, it says "intentional sitelink to redirect". I'm not sure what the reasoning is, as in most cases, the link is to the main page with no redirects.

somehow keep redirects in the wikipedia dump, and follow those

The wikiDATA dump doesn't know about redirects afaik. It just has wikiPEDIA links, which sometimes happen to redirect. So I'm not sure if we can get the redirect information, other than by actually making http requests to it, which is too painful/slow.

davidebbo commented 4 months ago

[Copied from @hyanwong's comment in #49]

The wikiDATA dump doesn't know about redirects afaik. It just has wikiPEDIA links, which sometimes happen to redirect. So I'm not sure if we can get the redirect information, other than by actually making http requests to it, which is too painful/slow.

Indeed, but perhaps the enwiki-latest-page.sql.gz file contains information on redirects?

davidebbo commented 4 months ago

Indeed, but perhaps the enwiki-latest-page.sql.gz file contains information on redirects?

Yes, it has a page_is_redirect Boolean field (https://www.mediawiki.org/wiki/Manual:Page_table#page_is_redirect). I don't think it gives the redirection target, but at least if we had to make a request, that would greatly reduce the number of cases where it's needed.

hyanwong commented 4 months ago

that would greatly reduce the number of cases where it's needed.

Good point. This makes the logic a bit more convoluted, doesn't it, but I think it is probably worth doing. As a half-way house we could check in the SQL dump but rather than locate the proper pagename via the wikipedia API, we could simply emit a warning that the name is a redirect, and won't be used for popularity.

davidebbo commented 4 months ago

This makes the logic a bit more convoluted, doesn't it, but I think it is probably worth doing.

Yes, TBH, I don't like the idea of making http requests during processing. Today, everything happens offline, which is nice.

What we could do is have a separate process which looks for all such entries and figures out the redirects, then saves the results in a mapping file that we commit. Presumably, it wouldn't change that often. Then CSV_base_table_creator can just rely on this file to go to the correct entry when processing both page views and page sizes.

As a half-way house we could check in the SQL dump but rather than locate the proper pagename via the wikipedia API, we could simply emit a warning that the name is a redirect, and won't be used for popularity.

Yes, we could start there. Would be interesting to see how common of an issue it is today.

hyanwong commented 4 months ago

Today, everything happens offline, which is nice.

Yes, I agree this is much nicer.

davidebbo commented 4 months ago

I made a change to include the page_is_redirect column in the filtered SQL dump.

There are 10334 entries in our filtered dump that has this set to 1. But note that this includes all entries in our filtered wikidata dump, which has all taxons & vernaculars. So in practice, it's likely a much smaller set of redirects that actually affect us. We can get better stats once we work on this in the popularity logic in CSV_base_table_creator.

hyanwong commented 4 months ago

Great, thanks @davidebbo. 10334 sounds quite a lot, but as you say, only a subset will be relevant to us, fortunately.