Closed brawer closed 6 months ago
Current logs:
2024/05/15 14:22:22 main.go:47: qrank-builder starting up
2024/05/15 14:22:27 build.go:23: found wikimedia dumps for 984 sites
2024/05/15 14:22:27 itemsignals.go:155: building public/item_signals-20240512.csv.zst
2024/05/15 14:22:27 pagesignals.go:385: PageSignalsScanner.Scan(): opening page_signals/aawikibooks-20240501-page_signals.zst
2024/05/15 14:22:32 itemsignals.go:192: BuildItemSignals(): merging signals from 53 files; #0=PageSignalsScanner; rest=pageviews
2024/05/15 14:22:32 pagesignals.go:385: PageSignalsScanner.Scan(): opening page_signals/aawiki-20240501-page_signals.zst
2024/05/15 14:22:33 pagesignals.go:385: PageSignalsScanner.Scan(): opening page_signals/aawiktionary-20240501-page_signals.zst
2024/05/15 14:22:34 pagesignals.go:385: PageSignalsScanner.Scan(): opening page_signals/abwiki-20240501-page_signals.zst
[...]
2024/05/15 14:52:22 pagesignals.go:385: PageSignalsScanner.Scan(): opening page_signals/zh_min_nanwiktionary-20240501-page_signals.zst
2024/05/15 14:52:22 pagesignals.go:385: PageSignalsScanner.Scan(): opening page_signals/zh_yuewiki-20240501-page_signals.zst
2024/05/15 14:52:23 linemerger.go:66: LineMerger: scanner #38 failed, err=unexpected EOF
2024/05/15 14:52:23 itemsignals.go:206: ItemSignalsJoiner: read 3366859624 lines
2024/05/15 14:52:23 itemsignals.go:208: LineMerger failed: unexpected EOF
2024/05/15 14:52:23 itemsignals.go:215: BuildItemSignals(): start sorting
2024/05/15 14:52:23 main.go:63: ComputeQRank failed: unexpected EOF
File 38 is a pageviews file.
File 38 is a pageviews file.
Maybe, or maybe not. Actually the logged number is the current position in the min-heap that LineMerger uses for merging its input streams. It is not the index in the array of scanners passed to NewLineMerger()
. To figure out what’s going on, I changed LineMerger
to take names for its scanners; now it logs the name of the failing scanner in case of an error.
WIth verbose logging in place, the error gets reported for pageviews/pageviews-2024-W16.zst
:
[...]
2024/05/16 09:39:11 pagesignals.go:385: PageSignalsScanner.Scan(): opening page_signals/zhwikinews-20240501-page_signals.zst
2024/05/16 09:39:12 pagesignals.go:385: PageSignalsScanner.Scan(): opening page_signals/zhwiki-20240501-page_signals.zst
2024/05/16 09:39:15 linemerger.go:73: LineMerger: scanner "pageviews/pageviews-2024-W16.zst" failed, err=unexpected EOF
2024/05/16 09:39:15 itemsignals.go:209: ItemSignalsJoiner: read 3378429445 lines
2024/05/16 09:39:15 itemsignals.go:211: LineMerger failed: unexpected EOF
2024/05/16 09:39:15 itemsignals.go:218: BuildItemSignals(): start sorting
2024/05/16 09:39:15 main.go:63: ComputeQRank failed: unexpected EOF
However, that file looks completely normal:
$ s3cmd get s3://qrank/pageviews/pageviews-2024-W16.zst
$ zstdcat pageviews-2024-W16.zst >pageviews-2024-W16.txt
$ tail -4 pageviews-2024-W16.txt
zu.wiktionary,4708,1
zu.wiktionary,4710,6
zu.wiktionary,4711,3
zu.wiktionary,769,6
It also ends with a newline character, as expected:
$ hexdump -C pageviews-2024-W16.txt | tail -4
697423d0 34 37 31 30 2c 36 0a 7a 75 2e 77 69 6b 74 69 6f |4710,6.zu.wiktio|
697423e0 6e 61 72 79 2c 34 37 31 31 2c 33 0a 7a 75 2e 77 |nary,4711,3.zu.w|
697423f0 69 6b 74 69 6f 6e 61 72 79 2c 37 36 39 2c 36 0a |iktionary,769,6.|
69742400
The best bugs are non-deterministic.
First pipeline run (already reported earlier on this bug):
2024/05/16 09:39:12 pagesignals.go:385: PageSignalsScanner.Scan(): opening page_signals/zhwiki-20240501-page_signals.zst
2024/05/16 09:39:15 linemerger.go:73: LineMerger: scanner "pageviews/pageviews-2024-W16.zst" failed, err=unexpected EOF
2024/05/16 09:39:15 itemsignals.go:209: ItemSignalsJoiner: read 3378429445 lines
2024/05/16 09:39:15 itemsignals.go:211: LineMerger failed: unexpected EOF
2024/05/16 09:39:15 itemsignals.go:218: BuildItemSignals(): start sorting
2024/05/16 09:39:15 main.go:63: ComputeQRank failed: unexpected EOF
Second run, same code and same data:
2024/05/16 11:52:15 pagesignals.go:385: PageSignalsScanner.Scan(): opening page_signals/zhwiki-20240501-page_signals.zst
2024/05/16 11:52:17 linemerger.go:73: LineMerger: scanner "pageviews/pageviews-2023-W26.zst" failed, err=unexpected EOF
2024/05/16 11:52:17 itemsignals.go:209: ItemSignalsJoiner: read 3375560041 lines
2024/05/16 11:52:17 itemsignals.go:211: LineMerger failed: unexpected EOF
2024/05/16 11:52:17 itemsignals.go:218: BuildItemSignals(): start sorting
2024/05/16 11:52:17 main.go:63: ComputeQRank failed: unexpected EOF
Third run, again with same code and same data:
2024/05/16 12:32:34 pagesignals.go:385: PageSignalsScanner.Scan(): opening page_signals/zhwiki-20240501-page_signals.zst
2024/05/16 12:32:36 linemerger.go:73: LineMerger: scanner "pageviews/pageviews-2023-W44.zst" failed, err=unexpected EOF
2024/05/16 12:32:36 itemsignals.go:209: ItemSignalsJoiner: read 3373902078 lines
2024/05/16 12:32:36 itemsignals.go:211: LineMerger failed: unexpected EOF
2024/05/16 12:32:36 itemsignals.go:218: BuildItemSignals(): start sorting
2024/05/16 12:32:36 main.go:63: ComputeQRank failed: unexpected EOF
Logs from running the pipeline at commit b5537d75d6bffc95fe8ff5b6d387fc32249fcc70, which made logging even more verbose:
2024/05/16 13:23:39 pagesignals.go:391: PageSignalsScanner.Scan(): opening page_signals/zh_yuewiki-20240501-page_signals.zst
2024/05/16 13:23:42 pagesignals.go:391: PageSignalsScanner.Scan(): opening page_signals/zhwikibooks-20240501-page_signals.zst
2024/05/16 13:23:43 pagesignals.go:391: PageSignalsScanner.Scan(): opening page_signals/zhwikinews-20240501-page_signals.zst
2024/05/16 13:23:43 pagesignals.go:391: PageSignalsScanner.Scan(): opening page_signals/zhwiki-20240501-page_signals.zst
2024/05/16 13:23:44 linemerger.go:73: LineMerger: scanner "pageviews/pageviews-2024-W03.zst" failed, err=unexpected EOF
2024/05/16 13:23:44 itemsignals.go:210: ItemSignalsJoiner: read 3373055244 lines
2024/05/16 13:23:44 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/16 13:23:44 itemsignals.go:219: BuildItemSignals(): start sorting
2024/05/16 13:23:44 itemsignals.go:224: BuildItemSignals(): sorting canceled, groupCtx.Err()=context canceled
2024/05/16 13:23:44 itemsignals.go:243: BuildItemSignals(): group.Wait() failed, err==unexpected EOF
2024/05/16 13:23:44 main.go:63: ComputeQRank failed: unexpected EOF
Some observations from staring at the logs:
BuildItemSignals()
fails in the consuming goroutine while input data is still being read. While the external sorting and the output generation do an early exit, that’s only because the errgroup’s context is canceled due to the read error. So, the bug seems to be localized within the reading code.zhwiki-20240501-page_signals.zst
is the last file being opened. It’s actually not the last wiki project. The following files come after; however, note that we (need to) sort the processing by abbreviated Wikimedia domain, not Wikimedia project ID, which may or may not explain this.2024-05-11 12:58 1350153 s3://qrank/page_signals/zh_yuewiki-20240501-page_signals.zst
2024-05-11 13:19 16340759 s3://qrank/page_signals/zhwiki-20240501-page_signals.zst
2024-05-11 13:16 25961 s3://qrank/page_signals/zhwikibooks-20240501-page_signals.zst
2024-05-11 13:10 171748 s3://qrank/page_signals/zhwikinews-20240501-page_signals.zst
2024-05-11 13:18 63711 s3://qrank/page_signals/zhwikiquote-20240501-page_signals.zst
2024-05-11 13:11 2419930 s3://qrank/page_signals/zhwikisource-20240501-page_signals.zst
2024-05-11 13:21 27910 s3://qrank/page_signals/zhwikiversity-20240501-page_signals.zst
2024-05-11 13:03 59625 s3://qrank/page_signals/zhwikivoyage-20240501-page_signals.zst
2024-05-11 12:59 973478 s3://qrank/page_signals/zhwiktionary-20240501-page_signals.zst
2024-05-11 12:59 95906 s3://qrank/page_signals/zuwiki-20240501-page_signals.zst
2024-05-11 12:57 103 s3://qrank/page_signals/zuwikibooks-20240501-page_signals.zst
2024-05-11 13:18 523 s3://qrank/page_signals/zuwiktionary-20240501-page_signals.zst
In our processing order, zh.wikipedia
is the 975th project whose page signals are being read from S3 storage. Afterwards, there would still be a few more: zh.wikiquote
, zh.wikisource
, zh.wikiversity
, zh.wikivoyage
, zh.wiktionary
, zu.wikibooks
, zu.wikipedia
and zu.wiktionary
.
However, there seems nothing special about s3://qrank/page_signals/zhwiki-20240501-page_signals.zst
. The lines in that file have the same structure as in all other files, confirmed by egrep:
$ s3cmd get s3://qrank/page_signals/zhwiki-20240501-page_signals.zst
$ zstdcat zhwiki-20240501-page_signals.zst | egrep -v "\d+,Q\d+,\d*" | wc -l
0
Some thoughts:
zh.wikipedia
?The pipeline also fails after a hack that restricts the input to just one single pageviews file (in addition to page signals):
2024/05/16 22:49:12 linemerger.go:73: LineMerger: scanner "pageviews/pageviews-2024-W19.zst" failed, err=unexpected EOF
2024/05/16 22:49:12 itemsignals.go:217: ItemSignalsJoiner: read 271843468 lines
2024/05/16 22:49:12 itemsignals.go:219: LineMerger failed: unexpected EOF
2024/05/16 22:49:12 itemsignals.go:226: BuildItemSignals(): start sorting
2024/05/16 22:49:12 itemsignals.go:231: BuildItemSignals(): sorting canceled, groupCtx.Err()=context canceled
2024/05/16 22:49:12 itemsignals.go:250: BuildItemSignals(): group.Wait() failed, err==unexpected EOF
2024/05/16 22:49:12 main.go:63: ComputeQRank failed: unexpected EOF
Interestingly, there’s no hiccups if the pipeline ignores everything but zh.wikipedia.org
and zh.wikiquote.org
as in commit 6abd16124aedfa50da4cefcbe858b894c52dd86b.
2024/05/17 12:10:03 main.go:47: qrank-builder starting up
2024/05/17 12:10:09 build.go:23: found wikimedia dumps for 984 sites
2024/05/17 12:10:09 itemsignals.go:155: building public/item_signals-20240512.csv.zst
2024/05/17 12:10:09 pagesignals.go:356: NewPageSignalsScanner: paths[0]="page_signals/zhwiki-20240501-page_signals.zst", domains[0]="zh.wikipedia"
2024/05/17 12:10:09 pagesignals.go:356: NewPageSignalsScanner: paths[1]="page_signals/zhwikiquote-20240501-page_signals.zst", domains[1]="zh.wikiquote"
2024/05/17 12:10:09 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/zhwiki-20240501-page_signals.zst, curDomain=0, domain="zh.wikipedia"
2024/05/17 12:10:10 itemsignals.go:212: BuildItemSignals(): merging signals from 2 files; #0=PageSignalsScanner; rest=pageviews
2024/05/17 12:10:34 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/zhwikiquote-20240501-page_signals.zst, curDomain=1, domain="zh.wikiquote"
2024/05/17 12:10:34 pagesignals.go:394: PageSignalsScanner.Scan(): finished last domain
2024/05/17 12:10:34 pagesignals.go:423: PageSignalsScanner.Scan(): cleaning up
2024/05/17 12:10:34 itemsignals.go:227: ItemSignalsJoiner: read 74194066 lines
2024/05/17 12:10:34 itemsignals.go:236: BuildItemSignals(): start sorting
2024/05/17 12:10:42 itemsignals.go:267: BuildItemSignals(): finished sorting
2024/05/17 12:10:42 main.go:68: qrank-builder exiting
The current pipeline at bd246b2fa8cec8a22a81a1db3bf56b80a91cf2b5 now stress-tests LineMerger without creating any output. It seems to run fine. But to be really sure, I’ve scheduled hourly pipeline runs, and will just watch them for a couple days. Once we have logs from 100 runs or so, and they’ve all passed, we can look further. Another option might be to skip zh.wikipedia.org
in production builds, but it would be really nice to avoid such denylists and instead process all the data.
2024/05/17 15:02:27 main.go:47: qrank-builder starting up
2024/05/17 15:02:33 build.go:23: found wikimedia dumps for 984 sites
2024/05/17 15:02:33 itemsignals.go:155: building public/item_signals-20240512.csv.zst
2024/05/17 15:02:33 pagesignals.go:356: NewPageSignalsScanner: paths[0]="page_signals/aawikibooks-20240501-page_signals.zst", domains[0]="aa.wikibooks"
2024/05/17 15:02:33 pagesignals.go:356: NewPageSignalsScanner: paths[1]="page_signals/aawiki-20240501-page_signals.zst", domains[1]="aa.wikipedia"
2024/05/17 15:02:33 pagesignals.go:356: NewPageSignalsScanner: paths[2]="page_signals/aawiktionary-20240501-page_signals.zst", domains[2]="aa.wiktionary"
2024/05/17 15:02:33 pagesignals.go:356: NewPageSignalsScanner: paths[3]="page_signals/abwiki-20240501-page_signals.zst", domains[3]="ab.wikipedia"
2024/05/17 15:02:33 pagesignals.go:356: NewPageSignalsScanner: paths[975]="page_signals/zhwiki-20240501-page_signals.zst", domains[975]="zh.wikipedia"
2024/05/17 15:02:33 pagesignals.go:356: NewPageSignalsScanner: paths[976]="page_signals/zhwikiquote-20240501-page_signals.zst", domains[976]="zh.wikiquote"
2024/05/17 15:02:33 pagesignals.go:356: NewPageSignalsScanner: paths[977]="page_signals/zhwikisource-20240501-page_signals.zst", domains[977]="zh.wikisource"
2024/05/17 15:02:33 pagesignals.go:356: NewPageSignalsScanner: paths[978]="page_signals/zhwikiversity-20240501-page_signals.zst", domains[978]="zh.wikiversity"
2024/05/17 15:02:33 pagesignals.go:356: NewPageSignalsScanner: paths[979]="page_signals/zhwikivoyage-20240501-page_signals.zst", domains[979]="zh.wikivoyage"
2024/05/17 15:02:33 pagesignals.go:356: NewPageSignalsScanner: paths[980]="page_signals/zhwiktionary-20240501-page_signals.zst", domains[980]="zh.wiktionary"
2024/05/17 15:02:33 pagesignals.go:356: NewPageSignalsScanner: paths[981]="page_signals/zuwikibooks-20240501-page_signals.zst", domains[981]="zu.wikibooks"
2024/05/17 15:02:33 pagesignals.go:356: NewPageSignalsScanner: paths[982]="page_signals/zuwiki-20240501-page_signals.zst", domains[982]="zu.wikipedia"
2024/05/17 15:02:33 pagesignals.go:356: NewPageSignalsScanner: paths[983]="page_signals/zuwiktionary-20240501-page_signals.zst", domains[983]="zu.wiktionary"
2024/05/17 15:02:33 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/aawikibooks-20240501-page_signals.zst, curDomain=0, domain="aa.wikibooks"
2024/05/17 15:02:39 itemsignals.go:192: BuildItemSignals(): start testing LineMerger
2024/05/17 15:02:39 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/aawiki-20240501-page_signals.zst, curDomain=1, domain="aa.wikipedia"
2024/05/17 15:02:39 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/aawiktionary-20240501-page_signals.zst, curDomain=2, domain="aa.wiktionary"
2024/05/17 15:02:40 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/abwiki-20240501-page_signals.zst, curDomain=3, domain="ab.wikipedia"
2024/05/17 15:02:40 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/abwiktionary-20240501-page_signals.zst, curDomain=4, domain="ab.wiktionary"
2024/05/17 15:02:40 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/acewiki-20240501-page_signals.zst, curDomain=5, domain="ace.wikipedia"
[...]
2024/05/17 15:22:22 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/zh_yuewiki-20240501-page_signals.zst, curDomain=972, domain="zh-yue.wikipedia"
2024/05/17 15:22:24 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/zhwikibooks-20240501-page_signals.zst, curDomain=973, domain="zh.wikibooks"
2024/05/17 15:22:24 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/zhwikinews-20240501-page_signals.zst, curDomain=974, domain="zh.wikinews"
2024/05/17 15:22:25 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/zhwiki-20240501-page_signals.zst, curDomain=975, domain="zh.wikipedia"
2024/05/17 15:22:52 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/zhwikiquote-20240501-page_signals.zst, curDomain=976, domain="zh.wikiquote"
2024/05/17 15:22:52 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/zhwikisource-20240501-page_signals.zst, curDomain=977, domain="zh.wikisource"
2024/05/17 15:22:55 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/zhwikiversity-20240501-page_signals.zst, curDomain=978, domain="zh.wikiversity"
2024/05/17 15:22:55 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/zhwikivoyage-20240501-page_signals.zst, curDomain=979, domain="zh.wikivoyage"
2024/05/17 15:22:56 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/zhwiktionary-20240501-page_signals.zst, curDomain=980, domain="zh.wiktionary"
2024/05/17 15:23:01 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/zuwikibooks-20240501-page_signals.zst, curDomain=981, domain="zu.wikibooks"
2024/05/17 15:23:02 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/zuwiki-20240501-page_signals.zst, curDomain=982, domain="zu.wikipedia"
2024/05/17 15:23:02 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/zuwiktionary-20240501-page_signals.zst, curDomain=983, domain="zu.wiktionary"
2024/05/17 15:23:02 pagesignals.go:394: PageSignalsScanner.Scan(): finished last domain
2024/05/17 15:23:02 pagesignals.go:423: PageSignalsScanner.Scan(): cleaning up
2024/05/17 15:23:02 itemsignals.go:215: BuildItemSignals(): finished testing LineMerger, returned 3493549482 lines, 0 of which were mis-ordered
2024/05/17 15:23:02 main.go:68: qrank-builder exiting
Currently, the pipeline (built at commit bd246b2fa8cec8a22a81a1db3bf56b80a91cf2b5) just counts the number of lines returned by LineMerger. Despite the trivial implementation, we still see failures. Interestingly, the failures are intermittent: we always process the same data with the same code, but some of the hourly runs succeed while others fail.
It’s tempting to blame Wikimedia’s datacenter for being flaky, but it seems very unlikely that intermittent network problems would always manifest while reading internal data files for zh.wikipedia.org
.
Looking for potential sources of non-determinism, one thing comes to mind: To merge 52 weekly pageview counts with other page signals, LineMerger operates on a min-heap of 53 input streams. In the current implementation, these are partially ordered. If a Wikipedia page happens to have the same pageview counts in multiple weeks, any one of those weekly files may become the next top of heap. While this does not affect the correctness of the computed output (the returned lines are in correct order, as verified) it makes the processing order non-determinisitic.
2024/05/17 17:12:46 itemsignals.go:215: BuildItemSignals(): finished testing LineMerger, returned 3493549482 lines, 0 of which were mis-ordered
2024/05/17 18:20:51 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/17 19:11:57 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/17 20:33:09 itemsignals.go:215: BuildItemSignals(): finished testing LineMerger, returned 3493549482 lines, 0 of which were mis-ordered
2024/05/17 21:36:52 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/17 22:38:35 itemsignals.go:215: BuildItemSignals(): finished testing LineMerger, returned 3493549482 lines, 0 of which were mis-ordered
2024/05/17 23:29:53 itemsignals.go:215: BuildItemSignals(): finished testing LineMerger, returned 3493549482 lines, 0 of which were mis-ordered
2024/05/18 00:11:36 itemsignals.go:215: BuildItemSignals(): finished testing LineMerger, returned 3493549482 lines, 0 of which were mis-ordered
2024/05/18 04:03:54 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/18 05:11:37 itemsignals.go:215: BuildItemSignals(): finished testing LineMerger, returned 3493549482 lines, 0 of which were mis-ordered
2024/05/18 06:14:58 itemsignals.go:215: BuildItemSignals(): finished testing LineMerger, returned 3493549482 lines, 0 of which were mis-ordered
Removed the source of non-determinism, deployed a fresh binary at commit a4e93bf98d21eb4dcf18c78b8cff3cf3cc26d0da, scheduled hourly runs. Let’s see what happens.
We still see intermittent failures: Sometimes it’s fine, sometimes we get an EOF error.
2024/05/18 10:32:36 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/18 11:22:55 itemsignals.go:215: BuildItemSignals(): finished testing LineMerger, returned 3493549482 lines, 0 of which were mis-ordered
2024/05/18 12:25:36 itemsignals.go:215: BuildItemSignals(): finished testing LineMerger, returned 3493549482 lines, 0 of which were mis-ordered
2024/05/18 13:21:17 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/18 14:51:05 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/18 15:37:17 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/18 16:22:00 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/18 17:47:09 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/18 18:21:15 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/18 19:52:18 itemsignals.go:215: BuildItemSignals(): finished testing LineMerger, returned 3493549482 lines, 0 of which were mis-ordered
2024/05/18 20:29:56 itemsignals.go:215: BuildItemSignals(): finished testing LineMerger, returned 3493549482 lines, 0 of which were mis-ordered
2024/05/18 21:41:01 itemsignals.go:215: BuildItemSignals(): finished testing LineMerger, returned 3493549482 lines, 0 of which were mis-ordered
2024/05/18 22:58:14 itemsignals.go:215: BuildItemSignals(): finished testing LineMerger, returned 3493549482 lines, 0 of which were mis-ordered
2024/05/19 00:40:28 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/19 01:21:03 itemsignals.go:215: BuildItemSignals(): finished testing LineMerger, returned 3493549482 lines, 0 of which were mis-ordered
2024/05/19 03:29:30 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/19 04:43:15 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/19 05:22:19 itemsignals.go:215: BuildItemSignals(): finished testing LineMerger, returned 3493549482 lines, 0 of which were mis-ordered
2024/05/19 06:37:20 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/19 07:30:05 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/19 09:29:48 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/19 10:58:33 itemsignals.go:212: LineMerger failed: unexpected EOF
LineMerger now logs its internal state upon error, due to commit ffd719145eb06a2824eaff1d31a12d50eefa1f6a. Last night, it ran seven times with the new code. Three of these runs (same code on same data) were successful, four have failed:
2024/05/19 17:19:59 itemsignals.go:215: BuildItemSignals(): finished testing LineMerger, returned 3493549482 lines, 0 of which were mis-ordered
2024/05/19 19:04:39 itemsignals.go:215: BuildItemSignals(): finished testing LineMerger, returned 3493549482 lines, 0 of which were mis-ordered
2024/05/19 21:12:06 itemsignals.go:215: BuildItemSignals(): finished testing LineMerger, returned 3493549482 lines, 0 of which were mis-ordered
2024/05/20 00:59:11 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/20 02:54:32 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/20 03:25:12 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/20 05:18:52 itemsignals.go:212: LineMerger failed: unexpected EOF
As observed before, the unexpected EOF errors happened non-deterministically at different offsets in different files. However, if an error occurs, it’s always while processing the Chinese Wikipedia.
2024/05/20 00:59:11 linemerger.go:82: LineMerger: scanner "pageviews/pageviews-2024-W12.zst" failed, lastLine="zh.wikipedia,2002412,3", err=unexpected EOF
2024/05/20 02:54:32 linemerger.go:82: LineMerger: scanner "pageviews/pageviews-2023-W45.zst" failed, lastLine="zh.wikipedia,2375954,1", err=unexpected EOF
2024/05/20 03:25:12 linemerger.go:82: LineMerger: scanner "pageviews/pageviews-2023-W48.zst" failed, lastLine="zh.wikipedia,2790350,1", err=unexpected EOF
2024/05/20 05:18:52 linemerger.go:82: LineMerger: scanner "pageviews/pageviews-2024-W13.zst" failed, lastLine="zh.wikipedia,2901336,3", err=unexpected EOF
Here’s when these jobs have started up:
2024/05/19 16:57:26 main.go:47: qrank-builder starting up
2024/05/19 18:41:58 main.go:47: qrank-builder starting up
2024/05/19 20:49:31 main.go:47: qrank-builder starting up
2024/05/20 00:33:43 main.go:47: qrank-builder starting up
2024/05/20 02:28:38 main.go:47: qrank-builder starting up
2024/05/20 02:59:02 main.go:47: qrank-builder starting up
2024/05/20 04:53:31 main.go:47: qrank-builder starting up
Looking at how long it took from startup to either success or failure, there seems to be an interesting pattern here: the successful jobs are faster than the failures. This smells heavlily like some component is setting a timeout on how long the input stream may stay open. And instead of reporting the timeout as an error, it just closes the stream??? How nasty.
Duration | Status |
---|---|
00:22:33 | OK |
00:22:41 | OK |
00:22:35 | OK |
00:25:28 | Failure |
00:25:54 | Failure |
00:26:10 | Failure |
00:25:21 | Failure |
Next step: Analyze the runtimes of all previous runs.
Hm, it’s not really a consistent picture:
Termination | Runtime | Status |
---|---|---|
2024/05/14 08:52:43 | 00:30:09 | Failure |
2024/05/14 13:19:21 | 00:31:08 | Failure |
2024/05/14 17:04:17 | 00:29:29 | Failure |
2024/05/15 12:40:04 | 00:29:03 | Failure |
2024/05/15 14:52:23 | 00:30:01 | Failure |
2024/05/16 09:39:15 | 00:32:15 | Failure |
2024/05/16 11:52:17 | 00:33:19 | Failure |
2024/05/16 12:32:36 | 00:31:51 | Failure |
2024/05/16 13:23:44 | 00:32:34 | Failure |
2024/05/16 15:06:50 | 00:32:41 | Failure |
2024/05/17 14:41:38 | 00:22:22 | Failure |
2024/05/17 15:23:02 | 00:20:35 | OK |
2024/05/17 17:12:46 | 00:22:14 | OK |
2024/05/17 18:20:51 | 00:21:56 | Failure |
2024/05/17 19:11:57 | 00:22:08 | Failure |
2024/05/17 20:33:09 | 00:21:53 | OK |
2024/05/17 21:36:52 | 00:21:52 | Failure |
2024/05/17 22:38:35 | 00:21:56 | OK |
2024/05/17 23:29:53 | 00:22:11 | OK |
2024/05/18 00:11:36 | 00:22:14 | OK |
2024/05/18 04:03:54 | 00:22:21 | Failure |
2024/05/18 05:11:37 | 00:22:01 | OK |
2024/05/18 06:14:58 | 00:21:59 | OK |
2024/05/18 10:32:36 | 00:24:36 | Failure |
2024/05/18 11:22:55 | 00:24:28 | OK |
2024/05/18 12:25:36 | 00:24:17 | OK |
2024/05/18 13:21:17 | 00:24:03 | Failure |
2024/05/18 14:51:05 | 00:24:14 | Failure |
2024/05/18 15:37:17 | 00:24:18 | Failure |
2024/05/18 16:22:00 | 00:24:27 | Failure |
2024/05/18 17:47:09 | 00:24:10 | Failure |
2024/05/18 18:21:15 | 00:24:26 | Failure |
2024/05/18 19:52:18 | 00:24:31 | OK |
2024/05/18 20:29:56 | 00:24:26 | OK |
2024/05/18 21:41:01 | 00:24:26 | OK |
2024/05/18 22:58:14 | 00:24:20 | OK |
2024/05/19 00:40:28 | 00:23:58 | Failure |
2024/05/19 01:21:03 | 00:24:20 | OK |
2024/05/19 03:29:30 | 00:24:13 | Failure |
2024/05/19 04:43:15 | 00:24:48 | Failure |
2024/05/19 05:22:19 | 00:24:26 | OK |
2024/05/19 06:37:20 | 00:24:08 | Failure |
2024/05/19 07:30:05 | 00:24:26 | Failure |
2024/05/19 09:29:48 | 00:23:58 | Failure |
2024/05/19 17:19:59 | 00:22:33 | OK |
2024/05/19 19:04:39 | 00:22:41 | OK |
2024/05/19 21:12:06 | 00:22:35 | OK |
2024/05/20 00:59:11 | 00:25:28 | Failure |
2024/05/20 02:54:32 | 00:25:54 | Failure |
2024/05/20 03:25:12 | 00:26:10 | Failure |
2024/05/20 05:18:52 | 00:25:21 | Failure |
2024/05/20 07:10:57 | 00:43:09 | Failure |
Commit 3081f6ee1e1b44223f6ceab5f8d6cb2047256e9a downloads all 52 pageviews files to /tmp
before processing. A first run in Wikimedia’s datacenter was successful, but that doesn’t mean much given the non-determinism. Let’s wait for the outcome of a couple hourly runs to see if this fixes the problem.
The fix seems to have worked.
It generally seems to work now, but we still see occasional network failures. Not sure if we can do anything here. We could implement exponential back-off, but it would be complicated. In case of failure, we already re-try a day later, which may be good enough as a work-around for the flakiness of Wikimedia’s datacenter.
2024/05/23 12:48:23 main.go:47: qrank-builder starting up
2024/05/23 12:48:31 build.go:23: found wikimedia dumps for 991 sites
2024/05/23 12:48:31 itemsignals.go:156: building public/item_signals-20240520.csv.zst
2024/05/23 12:54:40 pagesignals.go:381: PageSignalsScanner.Scan(): failed, domain=commons.wikimedia, err=read tcp 192.168.13.145:42564->185.15.56.161:443: read: connection reset by peer
2024/05/23 12:54:40 pagesignals.go:413: PageSignalsScanner.Scan(): cleaning up
2024/05/23 12:54:40 itemsignals.go:231: LineMerger failed: read tcp 192.168.13.145:42564->185.15.56.161:443: read: connection reset by peer
2024/05/23 12:54:40 itemsignals.go:242: BuildItemSignals(): sorting canceled, groupCtx.Err()=context canceled
2024/05/23 12:54:40 itemsignals.go:262: BuildItemSignals(): group.Wait() failed, err==read tcp 192.168.13.145:42564->185.15.56.161:443: read: connection reset by peer
2024/05/23 12:54:40 main.go:63: ComputeQRank failed: read tcp 192.168.13.145:42564->185.15.56.161:443: read: connection reset by peer
Several runs have finished. Closing this bug as fixed.
The current pipeline fails to build
item_signals
in production. However, the end-to-end integration test is passing perfectly fine.