brawer / wikidata-qrank

Ranking signals for Wikidata
https://qrank.wmcloud.org
MIT License
61 stars 5 forks source link

Speed up linktarget processing #45

Open brawer opened 3 months ago

brawer commented 3 months ago

For some large wikis, including enwiki, commonswiki and wikidatawiki, it currently takes several days to process the linktarget table. This is a problem: If the daily pipeline job hasn’t finished within 24 hours, Toolforge will kill and restart the process. (Which is good; we do want to have a watchdog in place in case the pipeline gets stuck).

To make this go faster, use a different join order. This will save time because pages without wikidata IDs will get dropped earlier than now. Also, we're currently re-sorting the contents of linktarget, even though the SQL dump is already sorted by primary key.

brawer commented 3 months ago

the SQL dump is already sorted by primary key

Actually, that may not always be the case. The SQL dumps are mostly sorted, but sometimes there’s exceptions.