brawer / wikidata-qrank

Ranking signals for Wikidata
https://qrank.wmcloud.org
MIT License
58 stars 5 forks source link

Compute PageRank #10

Open brawer opened 1 year ago

brawer commented 1 year ago

It would be nice if QRank were to run the PageRank algorithm on the link graph in Wikipedia and sister projects. Something like https://github.com/athalhammer/danker but less resource-hungry, so it can work on all language editions together, be put in production, and run (weekly or at least bi-weekly) in the Wikimedia cloud. The results should be made available for public download, similar to the existing QRank signal.

Personally I’m actually not super convinced that PageRank provides a better ranking signal than the existing QRank — basically, PageRank is a mathematical model that tries to predict hypothetical user behavior (by analyzing the structure of the link graph), whereas QRank measures what real users have done in practice (by analyzing Wikimedia logs). However, the choice of ranking algorithm should be left to users: if someone wants PageRank, they should be able to get it, easily and reliably. And there’s definitely a value in being able to combine multiple ranking signals.

athalhammer commented 1 year ago

Hi @brawer, seeing this just now. Hm, I designed danker to run with 8GB of main memory on some 2-4 core CPU. The problem is - until the link graph is established uses between 300 to 400 GB of disk space (later the unzipped graph is like 100gb). That is the only resource-intensive part but I could run this on a raspberry pi with 8gb ram and a connected usb disk.

Danker can, in fact, run on all language editions together (see here for the link graph) and I'm also irregularly running the script and offer the scores here in multiple formats. The hdt format is particularly neat as you can run federated queries to the wikidata endpoint with just downloading some 200MB and run it in docker-compose.

Update 2023-NOV: I bought a Raspberry Pi 4B, 8GB to save cloud on cost. So far, computation runs smoothly.

brawer commented 3 weeks ago

Next step: https://github.com/brawer/wikidata-qrank/issues/43

brawer commented 2 weeks ago

While #43 seems to compute the correct results, the approach taken turned out to be too slow, especially for large wikis such as enwiki, dewiki, commonswiki and wikidatawiki. https://github.com/brawer/wikidata-qrank/issues/45