Closed brawer closed 6 months ago
Re-generated wikidatawiki-20240501-page_entities.zst
with a pipeline built from current head revision at 30c0b278d509963fe38cf179e20515fa5bf05172. After the recent code changes, the page_entities
for wikidatawiki
now contains 112.6 million entries. From spot-checking a handful of entries, the file contents look good. The compressed file size is 272 MB (uncompressed, it would be 2.15 GB). The build took 15 minutes and 49 seconds, using about 1.3 CPU cores and 400 MB RAM.
$ s3cmd rm s3://qrank/page_entities/wikidatawiki-20240501-page_entities.zst
$ ssh login.toolforge.org
$ become qrank
$ toolforge build start https://github.com/brawer/wikidata-qrank
$ toolforge jobs run --command qrank-builder --image tool-qrank/tool-qrank:latest --mount=all --mem=3Gi --cpu=3 qrank-builder-test
The file
wikidatawiki-20240501-page_entities.zst
currently has 6448 entries; expected 109 million.Our
page-id → wikidata-id
mappings are computed from parsing thepage_props
dumps. However, other than with the other wiki projects,wikidatawiki.page_props
only contains the Wikidata IDs for internal maintenance pages such as Q5649951, not for content pages such as Q72.To produce correct output for
wikidatawiki
, we should go over itspages
table, match all page titles against a regular expression to check if looks like a Wikidata ID, and then inject the mapping into our data.