Open davidebbo opened 4 months ago
[Copied from @hyanwong's comment in #49]
has the english link linking to Leopardus_pajeros (which is a redirect)
So we can either change the link in Wikidata and/or somehow keep redirects in the wikipedia dump, and follow those?
I seem to remember that I did do a bit of redirect-following in my original wikimedia parser, but I think I didn't think about that case on wikipedia.
Yan
change the link in Wikidata
Interestingly, if you look at the links on https://www.wikidata.org/wiki/Q311417, next to 'en', there is a little arrow symbol. If you hover over it, it says "intentional sitelink to redirect". I'm not sure what the reasoning is, as in most cases, the link is to the main page with no redirects.
somehow keep redirects in the wikipedia dump, and follow those
The wikiDATA dump doesn't know about redirects afaik. It just has wikiPEDIA links, which sometimes happen to redirect. So I'm not sure if we can get the redirect information, other than by actually making http requests to it, which is too painful/slow.
[Copied from @hyanwong's comment in #49]
The wikiDATA dump doesn't know about redirects afaik. It just has wikiPEDIA links, which sometimes happen to redirect. So I'm not sure if we can get the redirect information, other than by actually making http requests to it, which is too painful/slow.
Indeed, but perhaps the enwiki-latest-page.sql.gz
file contains information on redirects?
Indeed, but perhaps the
enwiki-latest-page.sql.gz
file contains information on redirects?
Yes, it has a page_is_redirect
Boolean field (https://www.mediawiki.org/wiki/Manual:Page_table#page_is_redirect). I don't think it gives the redirection target, but at least if we had to make a request, that would greatly reduce the number of cases where it's needed.
that would greatly reduce the number of cases where it's needed.
Good point. This makes the logic a bit more convoluted, doesn't it, but I think it is probably worth doing. As a half-way house we could check in the SQL dump but rather than locate the proper pagename via the wikipedia API, we could simply emit a warning that the name is a redirect, and won't be used for popularity.
This makes the logic a bit more convoluted, doesn't it, but I think it is probably worth doing.
Yes, TBH, I don't like the idea of making http requests during processing. Today, everything happens offline, which is nice.
What we could do is have a separate process which looks for all such entries and figures out the redirects, then saves the results in a mapping file that we commit. Presumably, it wouldn't change that often. Then CSV_base_table_creator
can just rely on this file to go to the correct entry when processing both page views and page sizes.
As a half-way house we could check in the SQL dump but rather than locate the proper pagename via the wikipedia API, we could simply emit a warning that the name is a redirect, and won't be used for popularity.
Yes, we could start there. Would be interesting to see how common of an issue it is today.
Today, everything happens offline, which is nice.
Yes, I agree this is much nicer.
I made a change to include the page_is_redirect
column in the filtered SQL dump.
There are 10334 entries in our filtered dump that has this set to 1. But note that this includes all entries in our filtered wikidata dump, which has all taxons & vernaculars. So in practice, it's likely a much smaller set of redirects that actually affect us. We can get better stats once we work on this in the popularity logic in CSV_base_table_creator
.
Great, thanks @davidebbo. 10334 sounds quite a lot, but as you say, only a subset will be relevant to us, fortunately.
[This is a spin-off issue from #49]
Another weird case: Leopardus pajeros is the Pampas cat.
Problem is that the wikidata entry has the english link linking to Leopardus_pajeros (which is a redirect), instead of to the main Pampas cat page:
"enwiki": { "title": "Leopardus pajeros" }
. So we end up looking upLeopardus pajeros
in the Page Count file, and not finding anything, because all the hits are with the Pampas cat entry.