brawer / wikidata-qrank

Ranking signals for Wikidata
https://qrank.wmcloud.org
MIT License
67 stars 5 forks source link

Missing page_entities for wikidatawiki #35

Closed brawer closed 6 months ago

brawer commented 6 months ago

The file wikidatawiki-20240501-page_entities.zst currently has 6448 entries; expected 109 million.

Our page-id → wikidata-id mappings are computed from parsing the page_props dumps. However, other than with the other wiki projects, wikidatawiki.page_props only contains the Wikidata IDs for internal maintenance pages such as Q5649951, not for content pages such as Q72.

To produce correct output for wikidatawiki, we should go over its pages table, match all page titles against a regular expression to check if looks like a Wikidata ID, and then inject the mapping into our data.

brawer commented 6 months ago

Re-generated wikidatawiki-20240501-page_entities.zst with a pipeline built from current head revision at 30c0b278d509963fe38cf179e20515fa5bf05172. After the recent code changes, the page_entities for wikidatawiki now contains 112.6 million entries. From spot-checking a handful of entries, the file contents look good. The compressed file size is 272 MB (uncompressed, it would be 2.15 GB). The build took 15 minutes and 49 seconds, using about 1.3 CPU cores and 400 MB RAM.

$ s3cmd rm s3://qrank/page_entities/wikidatawiki-20240501-page_entities.zst
$ ssh login.toolforge.org
$ become qrank
$ toolforge build start https://github.com/brawer/wikidata-qrank
$ toolforge jobs run --command qrank-builder --image tool-qrank/tool-qrank:latest --mount=all --mem=3Gi --cpu=3 qrank-builder-test