OneZoom / tree-build

Scripts for assembling the tree, metadata and downstream data products such as popularity and popular images
MIT License
1 stars 2 forks source link

Logic to parse wikipedia SQL dump file handles quotes incorrectly #50

Closed davidebbo closed 1 month ago

davidebbo commented 2 months ago

I think this is a long standing bug. In the dump (e.g. enwiki-latest-page.sql.gz), entries look like this:

(273706,0,'Pallas\'s_cat',0,0,0.96134856108639,'20230601081506','20230601081534',1152795453,59673,'wikitext',NULL)

Note how the quote in Pallas's_cat is escaped with a backslash. But our code to read it has:

    csv.reader(file, quotechar="'", doublequote=True)

It does not pass an escapechar, which means that csv assumes that single quotes are escaped by doubling them, e.g. 'Pallas''s_cat'.

Bottom line is that we fail to read these correctly, and end up ignoring their page size, which in turns messes up the popularity calculation for those items.

/cc @hyanwong

hyanwong commented 2 months ago

Ah, really well spotted. Thanks @davidebbo . This "quote escaping" can be quite tricky and I am not surprised that we have a bug there.