brawer / wikidata-qrank

Ranking signals for Wikidata
https://qrank.wmcloud.org
MIT License
61 stars 5 forks source link

SQL parser cannot handle all page_props #26

Closed brawer closed 4 months ago

brawer commented 4 months ago

The current SQL parser (introduced for https://github.com/brawer/wikidata-qrank/issues/23) chokes on one of the *-page_props.sql.gz dumps. They’re processed in a thread pool of size runtime.NumCPU(), so we first need to figure out which file exactly makes the parser unhappy. Logs:

2024/05/06 19:22:12 main.go:119: found wikimedia dumps for 984 sites
2024/05/06 19:22:12 pageentities.go:71: building page_entities/wikimania2013wiki-20240501-page_entities.zst
2024/05/06 19:22:12 pageentities.go:71: building page_entities/aswiktionary-20240501-page_entities.zst
2024/05/06 19:22:12 pageentities.go:71: building page_entities/towiktionary-20240501-page_entities.zst
2024/05/06 19:22:12 pageentities.go:71: building page_entities/tswiki-20240501-page_entities.zst
2024/05/06 19:22:12 pageentities.go:71: building page_entities/sawikiquote-20240501-page_entities.zst
2024/05/06 19:22:12 pageentities.go:71: building page_entities/mtwiki-20240501-page_entities.zst
2024/05/06 19:22:12 pageentities.go:71: building page_entities/hewikiquote-20240501-page_entities.zst
2024/05/06 19:22:12 pageentities.go:71: building page_entities/svwiktionary-20240501-page_entities.zst
2024/05/06 19:22:13 pageentities.go:71: building page_entities/skrwiki-20240501-page_entities.zst
2024/05/06 19:22:13 main.go:63: ComputeQRank failed: sql parse error
brawer commented 4 months ago

There’s a \' sequence in hewikiquote-20240501-page_props.sql.gz that our SQL lexer does not seem to handle.

INSERT INTO `page_props` VALUES (18,'defaultsort','\327\244\327\250\327\240\327\247\327\234\327\231\327\237, \327\221\327\240\327\222\'\327\236\327\231\327\237',NULL);