ComputeQRank fails building monthly pageviews for April 2021

The following string is getting passed to combineCounts() in pageviews.go. It contains an UTF-8 sequence C2 85 for codepoint U+0085 NEXT LINE (NEL). We use strings.Fields() to split the string, and that function recognizes U+0085 as field delimiter because it is a whitespace character according to unicode.IsSpace(). To become more robust, we should change pageviews.go to split only at U+0020, and also ignore any count entries that don’t have exactly two columns. (In retrospect, perhaps we should have used a less fragile encoding than space-separated text, but well).

00000000  66 6f 75 6e 64 61 74 69  6f 6e 2e 77 69 6b 69 6d  |foundation.wikim|
00000010  65 64 69 61 2f 75 73 65  72 3a 74 72 c3 a1 c2 ba  |edia/user:tr....|
00000020  c2 a7 6e 5f 6e 67 75 79  c3 a1 c2 bb c2 85 6e 5f  |..n_nguy......n_|
00000030  6d 69 6e 68 5f 68 75 79  20 31                    |minh_huy 1|

brawer / wikidata-qrank

ComputeQRank fails building monthly pageviews for April 2021 #3