Closed brawer closed 3 years ago
The following string is getting passed to combineCounts()
in pageviews.go
. It contains an UTF-8 sequence C2 85
for codepoint U+0085 NEXT LINE (NEL)
. We use strings.Fields()
to split the string, and that function recognizes U+0085
as field delimiter because it is a whitespace character according to unicode.IsSpace()
. To become more robust, we should change pageviews.go
to split only at U+0020
, and also ignore any count entries that don’t have exactly two columns. (In retrospect, perhaps we should have used a less fragile encoding than space-separated text, but well).
00000000 66 6f 75 6e 64 61 74 69 6f 6e 2e 77 69 6b 69 6d |foundation.wikim|
00000010 65 64 69 61 2f 75 73 65 72 3a 74 72 c3 a1 c2 ba |edia/user:tr....|
00000020 c2 a7 6e 5f 6e 67 75 79 c3 a1 c2 bb c2 85 6e 5f |..n_nguy......n_|
00000030 6d 69 6e 68 5f 68 75 79 20 31 |minh_huy 1|
Somewhere in the pageview_complete dumps of April 2021, there’s a line that makes the parser fail. Change the parser to be more resilient.