brawer / wikidata-qrank

Ranking signals for Wikidata
https://qrank.wmcloud.org
MIT License
67 stars 5 forks source link

ComputeQRank fails building monthly pageviews for April 2021 #3

Closed brawer closed 3 years ago

brawer commented 3 years ago

Somewhere in the pageview_complete dumps of April 2021, there’s a line that makes the parser fail. Change the parser to be more resilient.

2021/05/17 14:39:10 pageviews.go:55: building monthly pageviews for 2021-04
2021/05/17 16:31:38 main.go:24: ComputeQRank failed: strconv.ParseInt: parsing "n_minh_huy": invalid syntax
brawer commented 3 years ago

The following string is getting passed to combineCounts() in pageviews.go. It contains an UTF-8 sequence C2 85 for codepoint U+0085 NEXT LINE (NEL). We use strings.Fields() to split the string, and that function recognizes U+0085 as field delimiter because it is a whitespace character according to unicode.IsSpace(). To become more robust, we should change pageviews.go to split only at U+0020, and also ignore any count entries that don’t have exactly two columns. (In retrospect, perhaps we should have used a less fragile encoding than space-separated text, but well).

00000000  66 6f 75 6e 64 61 74 69  6f 6e 2e 77 69 6b 69 6d  |foundation.wikim|
00000010  65 64 69 61 2f 75 73 65  72 3a 74 72 c3 a1 c2 ba  |edia/user:tr....|
00000020  c2 a7 6e 5f 6e 67 75 79  c3 a1 c2 bb c2 85 6e 5f  |..n_nguy......n_|
00000030  6d 69 6e 68 5f 68 75 79  20 31                    |minh_huy 1|