Closed brawer closed 4 months ago
Not sure why zgrep
would report wrong numbers, but it does. The actual truth is in the database, so let's ask MariaDB directly:
$ ssh login.toolforge.org
sascha@tools-bastion-13:~$ sql rmwiki
MariaDB [rmwiki_p]> SELECT COUNT(*) FROM page_props WHERE pp_propname = 'wikibase_item';
+----------+
| COUNT(*) |
+----------+
| 5221 |
+----------+
1 row in set (0.005 sec)
On May 7, 2024, the reported counts were 9650538 for enwiki
, 3517155 for dewiki
, and 5221 for rmwiki
. When going over Mediawiki dumps from May 1, 2024, we should get numbers in roughly that area.
After the code changes in merge request #31, the numbers for the dumps of May 1, 2024, are:
Wikimedia dump | MariaDB (May 7) | SQLReader (May 1) | Delta |
---|---|---|---|
enwiki-20240501-page_props.sql.gz | 9650538 | 9647060 | 3478 |
dewiki-20240501-page_props.sql.gz | 3517155 | 3515356 | 1799 |
rmwiki-20240501-page_props.sql.gz | 5221 | 5221 | 0 |
The deltas seem reasonable. The dumps are almost a week older than the live database, so a small amount of drift is to be expected.
For the record, I’ve deleted all cached page_entities
files with the following command. The next pipeline run will recompute the data files. We’re not using these files yet for computing QRank, but that way we won’t have polluted data when we do the switch-over to the new pipeline.
$ s3cmd rm s3://qrank/page_entities/*
The current SQLReader, which is intended for https://github.com/brawer/wikidata-qrank/issues/23, appears to skip entries.
For example, when processing the dump file
enwiki-20240501-page_props.sql.gz
, it currently finds 4530 pages whosepp_propname
iswikibase_item
. That number should be much higher:enwiki-20240501-page_props.sql.gz
9579,Q342573
9580,Q166546
9581,Q8889
dewiki-20240501-page_props.sql.gz
15705,Q194070
15707,Q101322
15708,Q11169
rmwiki-20240501-page_props.sql.gz
With
enwiki
anddewiki
, the Wikimedia dump contains multipleINSERT
statements. Our current SQLReader implementation does not handle this, which explains the delta.With
rmwiki
, our implementation returns more entries than zgrep; the first three added items are14571,Q1534701
,14572,Q22593703
and14573,Q22699664
. This might be due to unescaped zero bytes in the Wikimedia dump, which would be a bug on the Wikimedia side, but we need to double-check this.However, for our unit tests the implementation seems to work just fine, returning all expected data.