Note how the quote in Pallas's_cat is escaped with a backslash. But our code to read it has:
csv.reader(file, quotechar="'", doublequote=True)
It does not pass an escapechar, which means that csv assumes that single quotes are escaped by doubling them, e.g. 'Pallas''s_cat'.
Bottom line is that we fail to read these correctly, and end up ignoring their page size, which in turns messes up the popularity calculation for those items.
I think this is a long standing bug. In the dump (e.g.
enwiki-latest-page.sql.gz
), entries look like this:Note how the quote in Pallas's_cat is escaped with a backslash. But our code to read it has:
It does not pass an escapechar, which means that csv assumes that single quotes are escaped by doubling them, e.g.
'Pallas''s_cat'
.Bottom line is that we fail to read these correctly, and end up ignoring their page size, which in turns messes up the popularity calculation for those items.
/cc @hyanwong