brawer / wikidata-qrank

Ranking signals for Wikidata
https://qrank.wmcloud.org
MIT License
60 stars 5 forks source link

SQLReader seems to skip data #30

Closed brawer closed 4 months ago

brawer commented 4 months ago

The current SQLReader, which is intended for https://github.com/brawer/wikidata-qrank/issues/23, appears to skip entries.

For example, when processing the dump file enwiki-20240501-page_props.sql.gz, it currently finds 4530 pages whose pp_propname is wikibase_item. That number should be much higher:

$ wget https://dumps.wikimedia.org/enwiki/20240501/enwiki-20240501-page_props.sql.gz
$ zgrep -e "[0-9]*,'wikibase_item','Q[0-9]*'" -o enwiki-20240501-page_props.sql.gz | wc -l
  782730
Wikimedia dump Expected Observed First delta (page-id, wikidata-id)
enwiki-20240501-page_props.sql.gz 782730 4530 9579,Q342573 9580,Q166546 9581,Q8889
dewiki-20240501-page_props.sql.gz 378340 9026 15705,Q194070 15707,Q101322 15708,Q11169
rmwiki-20240501-page_props.sql.gz 5021 5221

With enwiki and dewiki, the Wikimedia dump contains multiple INSERT statements. Our current SQLReader implementation does not handle this, which explains the delta.

With rmwiki, our implementation returns more entries than zgrep; the first three added items are 14571,Q1534701, 14572,Q22593703 and 14573,Q22699664. This might be due to unescaped zero bytes in the Wikimedia dump, which would be a bug on the Wikimedia side, but we need to double-check this.

However, for our unit tests the implementation seems to work just fine, returning all expected data.

brawer commented 4 months ago

Not sure why zgrep would report wrong numbers, but it does. The actual truth is in the database, so let's ask MariaDB directly:

$ ssh login.toolforge.org
sascha@tools-bastion-13:~$ sql rmwiki
MariaDB [rmwiki_p]> SELECT COUNT(*) FROM page_props WHERE pp_propname = 'wikibase_item';
+----------+
| COUNT(*) |
+----------+
|     5221 |
+----------+
1 row in set (0.005 sec)

On May 7, 2024, the reported counts were 9650538 for enwiki, 3517155 for dewiki, and 5221 for rmwiki. When going over Mediawiki dumps from May 1, 2024, we should get numbers in roughly that area.

brawer commented 4 months ago

After the code changes in merge request #31, the numbers for the dumps of May 1, 2024, are:

Wikimedia dump MariaDB (May 7) SQLReader (May 1) Delta
enwiki-20240501-page_props.sql.gz 9650538 9647060 3478
dewiki-20240501-page_props.sql.gz 3517155 3515356 1799
rmwiki-20240501-page_props.sql.gz 5221 5221 0

The deltas seem reasonable. The dumps are almost a week older than the live database, so a small amount of drift is to be expected.

brawer commented 4 months ago

For the record, I’ve deleted all cached page_entities files with the following command. The next pipeline run will recompute the data files. We’re not using these files yet for computing QRank, but that way we won’t have polluted data when we do the switch-over to the new pipeline.

$ s3cmd rm s3://qrank/page_entities/*