brawer / wikidata-qrank

Ranking signals for Wikidata
https://qrank.wmcloud.org
MIT License
61 stars 5 forks source link

Pipeline fails to compute item_signals #40

Closed brawer closed 4 months ago

brawer commented 4 months ago

The current pipeline fails to build item_signals in production. However, the end-to-end integration test is passing perfectly fine.

2024/05/14 12:48:13 main.go:47: qrank-builder starting up
2024/05/14 12:48:17 main.go:120: found wikimedia dumps for 984 sites
2024/05/14 12:48:18 itemsignals.go:155: building public/item_signals-20240512.csv.zst
2024/05/14 13:19:21 main.go:63: ComputeQRank failed: unexpected EOF
brawer commented 4 months ago

Current logs:

2024/05/15 14:22:22 main.go:47: qrank-builder starting up
2024/05/15 14:22:27 build.go:23: found wikimedia dumps for 984 sites
2024/05/15 14:22:27 itemsignals.go:155: building public/item_signals-20240512.csv.zst
2024/05/15 14:22:27 pagesignals.go:385: PageSignalsScanner.Scan(): opening page_signals/aawikibooks-20240501-page_signals.zst
2024/05/15 14:22:32 itemsignals.go:192: BuildItemSignals(): merging signals from 53 files; #0=PageSignalsScanner; rest=pageviews
2024/05/15 14:22:32 pagesignals.go:385: PageSignalsScanner.Scan(): opening page_signals/aawiki-20240501-page_signals.zst
2024/05/15 14:22:33 pagesignals.go:385: PageSignalsScanner.Scan(): opening page_signals/aawiktionary-20240501-page_signals.zst
2024/05/15 14:22:34 pagesignals.go:385: PageSignalsScanner.Scan(): opening page_signals/abwiki-20240501-page_signals.zst
[...]
2024/05/15 14:52:22 pagesignals.go:385: PageSignalsScanner.Scan(): opening page_signals/zh_min_nanwiktionary-20240501-page_signals.zst
2024/05/15 14:52:22 pagesignals.go:385: PageSignalsScanner.Scan(): opening page_signals/zh_yuewiki-20240501-page_signals.zst
2024/05/15 14:52:23 linemerger.go:66: LineMerger: scanner #38 failed, err=unexpected EOF
2024/05/15 14:52:23 itemsignals.go:206: ItemSignalsJoiner: read 3366859624 lines
2024/05/15 14:52:23 itemsignals.go:208: LineMerger failed: unexpected EOF
2024/05/15 14:52:23 itemsignals.go:215: BuildItemSignals(): start sorting
2024/05/15 14:52:23 main.go:63: ComputeQRank failed: unexpected EOF

File 38 is a pageviews file.

brawer commented 4 months ago

File 38 is a pageviews file.

Maybe, or maybe not. Actually the logged number is the current position in the min-heap that LineMerger uses for merging its input streams. It is not the index in the array of scanners passed to NewLineMerger(). To figure out what’s going on, I changed LineMerger to take names for its scanners; now it logs the name of the failing scanner in case of an error.

brawer commented 4 months ago

WIth verbose logging in place, the error gets reported for pageviews/pageviews-2024-W16.zst:

[...]
2024/05/16 09:39:11 pagesignals.go:385: PageSignalsScanner.Scan(): opening page_signals/zhwikinews-20240501-page_signals.zst
2024/05/16 09:39:12 pagesignals.go:385: PageSignalsScanner.Scan(): opening page_signals/zhwiki-20240501-page_signals.zst
2024/05/16 09:39:15 linemerger.go:73: LineMerger: scanner "pageviews/pageviews-2024-W16.zst" failed, err=unexpected EOF
2024/05/16 09:39:15 itemsignals.go:209: ItemSignalsJoiner: read 3378429445 lines
2024/05/16 09:39:15 itemsignals.go:211: LineMerger failed: unexpected EOF
2024/05/16 09:39:15 itemsignals.go:218: BuildItemSignals(): start sorting
2024/05/16 09:39:15 main.go:63: ComputeQRank failed: unexpected EOF

However, that file looks completely normal:

$ s3cmd get s3://qrank/pageviews/pageviews-2024-W16.zst
$ zstdcat pageviews-2024-W16.zst >pageviews-2024-W16.txt
$ tail -4 pageviews-2024-W16.txt 
zu.wiktionary,4708,1
zu.wiktionary,4710,6
zu.wiktionary,4711,3
zu.wiktionary,769,6

It also ends with a newline character, as expected:

$ hexdump -C pageviews-2024-W16.txt | tail -4
697423d0  34 37 31 30 2c 36 0a 7a  75 2e 77 69 6b 74 69 6f  |4710,6.zu.wiktio|
697423e0  6e 61 72 79 2c 34 37 31  31 2c 33 0a 7a 75 2e 77  |nary,4711,3.zu.w|
697423f0  69 6b 74 69 6f 6e 61 72  79 2c 37 36 39 2c 36 0a  |iktionary,769,6.|
69742400
brawer commented 4 months ago

The best bugs are non-deterministic.

First pipeline run (already reported earlier on this bug):

2024/05/16 09:39:12 pagesignals.go:385: PageSignalsScanner.Scan(): opening page_signals/zhwiki-20240501-page_signals.zst
2024/05/16 09:39:15 linemerger.go:73: LineMerger: scanner "pageviews/pageviews-2024-W16.zst" failed, err=unexpected EOF
2024/05/16 09:39:15 itemsignals.go:209: ItemSignalsJoiner: read 3378429445 lines
2024/05/16 09:39:15 itemsignals.go:211: LineMerger failed: unexpected EOF
2024/05/16 09:39:15 itemsignals.go:218: BuildItemSignals(): start sorting
2024/05/16 09:39:15 main.go:63: ComputeQRank failed: unexpected EOF

Second run, same code and same data:

2024/05/16 11:52:15 pagesignals.go:385: PageSignalsScanner.Scan(): opening page_signals/zhwiki-20240501-page_signals.zst
2024/05/16 11:52:17 linemerger.go:73: LineMerger: scanner "pageviews/pageviews-2023-W26.zst" failed, err=unexpected EOF
2024/05/16 11:52:17 itemsignals.go:209: ItemSignalsJoiner: read 3375560041 lines
2024/05/16 11:52:17 itemsignals.go:211: LineMerger failed: unexpected EOF
2024/05/16 11:52:17 itemsignals.go:218: BuildItemSignals(): start sorting
2024/05/16 11:52:17 main.go:63: ComputeQRank failed: unexpected EOF

Third run, again with same code and same data:

2024/05/16 12:32:34 pagesignals.go:385: PageSignalsScanner.Scan(): opening page_signals/zhwiki-20240501-page_signals.zst
2024/05/16 12:32:36 linemerger.go:73: LineMerger: scanner "pageviews/pageviews-2023-W44.zst" failed, err=unexpected EOF
2024/05/16 12:32:36 itemsignals.go:209: ItemSignalsJoiner: read 3373902078 lines
2024/05/16 12:32:36 itemsignals.go:211: LineMerger failed: unexpected EOF
2024/05/16 12:32:36 itemsignals.go:218: BuildItemSignals(): start sorting
2024/05/16 12:32:36 main.go:63: ComputeQRank failed: unexpected EOF
brawer commented 4 months ago

Logs from running the pipeline at commit b5537d75d6bffc95fe8ff5b6d387fc32249fcc70, which made logging even more verbose:

2024/05/16 13:23:39 pagesignals.go:391: PageSignalsScanner.Scan(): opening page_signals/zh_yuewiki-20240501-page_signals.zst
2024/05/16 13:23:42 pagesignals.go:391: PageSignalsScanner.Scan(): opening page_signals/zhwikibooks-20240501-page_signals.zst
2024/05/16 13:23:43 pagesignals.go:391: PageSignalsScanner.Scan(): opening page_signals/zhwikinews-20240501-page_signals.zst
2024/05/16 13:23:43 pagesignals.go:391: PageSignalsScanner.Scan(): opening page_signals/zhwiki-20240501-page_signals.zst
2024/05/16 13:23:44 linemerger.go:73: LineMerger: scanner "pageviews/pageviews-2024-W03.zst" failed, err=unexpected EOF
2024/05/16 13:23:44 itemsignals.go:210: ItemSignalsJoiner: read 3373055244 lines
2024/05/16 13:23:44 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/16 13:23:44 itemsignals.go:219: BuildItemSignals(): start sorting
2024/05/16 13:23:44 itemsignals.go:224: BuildItemSignals(): sorting canceled, groupCtx.Err()=context canceled
2024/05/16 13:23:44 itemsignals.go:243: BuildItemSignals(): group.Wait() failed, err==unexpected EOF
2024/05/16 13:23:44 main.go:63: ComputeQRank failed: unexpected EOF
brawer commented 4 months ago

Some observations from staring at the logs:

2024-05-11 12:58      1350153  s3://qrank/page_signals/zh_yuewiki-20240501-page_signals.zst
2024-05-11 13:19     16340759  s3://qrank/page_signals/zhwiki-20240501-page_signals.zst
2024-05-11 13:16        25961  s3://qrank/page_signals/zhwikibooks-20240501-page_signals.zst
2024-05-11 13:10       171748  s3://qrank/page_signals/zhwikinews-20240501-page_signals.zst
2024-05-11 13:18        63711  s3://qrank/page_signals/zhwikiquote-20240501-page_signals.zst
2024-05-11 13:11      2419930  s3://qrank/page_signals/zhwikisource-20240501-page_signals.zst
2024-05-11 13:21        27910  s3://qrank/page_signals/zhwikiversity-20240501-page_signals.zst
2024-05-11 13:03        59625  s3://qrank/page_signals/zhwikivoyage-20240501-page_signals.zst
2024-05-11 12:59       973478  s3://qrank/page_signals/zhwiktionary-20240501-page_signals.zst
2024-05-11 12:59        95906  s3://qrank/page_signals/zuwiki-20240501-page_signals.zst
2024-05-11 12:57          103  s3://qrank/page_signals/zuwikibooks-20240501-page_signals.zst
2024-05-11 13:18          523  s3://qrank/page_signals/zuwiktionary-20240501-page_signals.zst
brawer commented 4 months ago

In our processing order, zh.wikipedia is the 975th project whose page signals are being read from S3 storage. Afterwards, there would still be a few more: zh.wikiquote, zh.wikisource, zh.wikiversity, zh.wikivoyage, zh.wiktionary, zu.wikibooks, zu.wikipedia and zu.wiktionary.

However, there seems nothing special about s3://qrank/page_signals/zhwiki-20240501-page_signals.zst. The lines in that file have the same structure as in all other files, confirmed by egrep:

$ s3cmd get s3://qrank/page_signals/zhwiki-20240501-page_signals.zst
$ zstdcat zhwiki-20240501-page_signals.zst | egrep -v "\d+,Q\d+,\d*" | wc -l
  0

Some thoughts:

brawer commented 4 months ago

The pipeline also fails after a hack that restricts the input to just one single pageviews file (in addition to page signals):

2024/05/16 22:49:12 linemerger.go:73: LineMerger: scanner "pageviews/pageviews-2024-W19.zst" failed, err=unexpected EOF
2024/05/16 22:49:12 itemsignals.go:217: ItemSignalsJoiner: read 271843468 lines
2024/05/16 22:49:12 itemsignals.go:219: LineMerger failed: unexpected EOF
2024/05/16 22:49:12 itemsignals.go:226: BuildItemSignals(): start sorting
2024/05/16 22:49:12 itemsignals.go:231: BuildItemSignals(): sorting canceled, groupCtx.Err()=context canceled
2024/05/16 22:49:12 itemsignals.go:250: BuildItemSignals(): group.Wait() failed, err==unexpected EOF
2024/05/16 22:49:12 main.go:63: ComputeQRank failed: unexpected EOF
brawer commented 4 months ago

Interestingly, there’s no hiccups if the pipeline ignores everything but zh.wikipedia.org and zh.wikiquote.org as in commit 6abd16124aedfa50da4cefcbe858b894c52dd86b.

2024/05/17 12:10:03 main.go:47: qrank-builder starting up
2024/05/17 12:10:09 build.go:23: found wikimedia dumps for 984 sites
2024/05/17 12:10:09 itemsignals.go:155: building public/item_signals-20240512.csv.zst
2024/05/17 12:10:09 pagesignals.go:356: NewPageSignalsScanner: paths[0]="page_signals/zhwiki-20240501-page_signals.zst", domains[0]="zh.wikipedia"
2024/05/17 12:10:09 pagesignals.go:356: NewPageSignalsScanner: paths[1]="page_signals/zhwikiquote-20240501-page_signals.zst", domains[1]="zh.wikiquote"
2024/05/17 12:10:09 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/zhwiki-20240501-page_signals.zst, curDomain=0, domain="zh.wikipedia"
2024/05/17 12:10:10 itemsignals.go:212: BuildItemSignals(): merging signals from 2 files; #0=PageSignalsScanner; rest=pageviews
2024/05/17 12:10:34 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/zhwikiquote-20240501-page_signals.zst, curDomain=1, domain="zh.wikiquote"
2024/05/17 12:10:34 pagesignals.go:394: PageSignalsScanner.Scan(): finished last domain
2024/05/17 12:10:34 pagesignals.go:423: PageSignalsScanner.Scan(): cleaning up
2024/05/17 12:10:34 itemsignals.go:227: ItemSignalsJoiner: read 74194066 lines
2024/05/17 12:10:34 itemsignals.go:236: BuildItemSignals(): start sorting
2024/05/17 12:10:42 itemsignals.go:267: BuildItemSignals(): finished sorting
2024/05/17 12:10:42 main.go:68: qrank-builder exiting
brawer commented 4 months ago

The current pipeline at bd246b2fa8cec8a22a81a1db3bf56b80a91cf2b5 now stress-tests LineMerger without creating any output. It seems to run fine. But to be really sure, I’ve scheduled hourly pipeline runs, and will just watch them for a couple days. Once we have logs from 100 runs or so, and they’ve all passed, we can look further. Another option might be to skip zh.wikipedia.org in production builds, but it would be really nice to avoid such denylists and instead process all the data.

2024/05/17 15:02:27 main.go:47: qrank-builder starting up
2024/05/17 15:02:33 build.go:23: found wikimedia dumps for 984 sites
2024/05/17 15:02:33 itemsignals.go:155: building public/item_signals-20240512.csv.zst
2024/05/17 15:02:33 pagesignals.go:356: NewPageSignalsScanner: paths[0]="page_signals/aawikibooks-20240501-page_signals.zst", domains[0]="aa.wikibooks"
2024/05/17 15:02:33 pagesignals.go:356: NewPageSignalsScanner: paths[1]="page_signals/aawiki-20240501-page_signals.zst", domains[1]="aa.wikipedia"
2024/05/17 15:02:33 pagesignals.go:356: NewPageSignalsScanner: paths[2]="page_signals/aawiktionary-20240501-page_signals.zst", domains[2]="aa.wiktionary"
2024/05/17 15:02:33 pagesignals.go:356: NewPageSignalsScanner: paths[3]="page_signals/abwiki-20240501-page_signals.zst", domains[3]="ab.wikipedia"
2024/05/17 15:02:33 pagesignals.go:356: NewPageSignalsScanner: paths[975]="page_signals/zhwiki-20240501-page_signals.zst", domains[975]="zh.wikipedia"
2024/05/17 15:02:33 pagesignals.go:356: NewPageSignalsScanner: paths[976]="page_signals/zhwikiquote-20240501-page_signals.zst", domains[976]="zh.wikiquote"
2024/05/17 15:02:33 pagesignals.go:356: NewPageSignalsScanner: paths[977]="page_signals/zhwikisource-20240501-page_signals.zst", domains[977]="zh.wikisource"
2024/05/17 15:02:33 pagesignals.go:356: NewPageSignalsScanner: paths[978]="page_signals/zhwikiversity-20240501-page_signals.zst", domains[978]="zh.wikiversity"
2024/05/17 15:02:33 pagesignals.go:356: NewPageSignalsScanner: paths[979]="page_signals/zhwikivoyage-20240501-page_signals.zst", domains[979]="zh.wikivoyage"
2024/05/17 15:02:33 pagesignals.go:356: NewPageSignalsScanner: paths[980]="page_signals/zhwiktionary-20240501-page_signals.zst", domains[980]="zh.wiktionary"
2024/05/17 15:02:33 pagesignals.go:356: NewPageSignalsScanner: paths[981]="page_signals/zuwikibooks-20240501-page_signals.zst", domains[981]="zu.wikibooks"
2024/05/17 15:02:33 pagesignals.go:356: NewPageSignalsScanner: paths[982]="page_signals/zuwiki-20240501-page_signals.zst", domains[982]="zu.wikipedia"
2024/05/17 15:02:33 pagesignals.go:356: NewPageSignalsScanner: paths[983]="page_signals/zuwiktionary-20240501-page_signals.zst", domains[983]="zu.wiktionary"
2024/05/17 15:02:33 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/aawikibooks-20240501-page_signals.zst, curDomain=0, domain="aa.wikibooks"
2024/05/17 15:02:39 itemsignals.go:192: BuildItemSignals(): start testing LineMerger
2024/05/17 15:02:39 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/aawiki-20240501-page_signals.zst, curDomain=1, domain="aa.wikipedia"
2024/05/17 15:02:39 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/aawiktionary-20240501-page_signals.zst, curDomain=2, domain="aa.wiktionary"
2024/05/17 15:02:40 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/abwiki-20240501-page_signals.zst, curDomain=3, domain="ab.wikipedia"
2024/05/17 15:02:40 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/abwiktionary-20240501-page_signals.zst, curDomain=4, domain="ab.wiktionary"
2024/05/17 15:02:40 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/acewiki-20240501-page_signals.zst, curDomain=5, domain="ace.wikipedia"
[...]
2024/05/17 15:22:22 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/zh_yuewiki-20240501-page_signals.zst, curDomain=972, domain="zh-yue.wikipedia"
2024/05/17 15:22:24 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/zhwikibooks-20240501-page_signals.zst, curDomain=973, domain="zh.wikibooks"
2024/05/17 15:22:24 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/zhwikinews-20240501-page_signals.zst, curDomain=974, domain="zh.wikinews"
2024/05/17 15:22:25 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/zhwiki-20240501-page_signals.zst, curDomain=975, domain="zh.wikipedia"
2024/05/17 15:22:52 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/zhwikiquote-20240501-page_signals.zst, curDomain=976, domain="zh.wikiquote"
2024/05/17 15:22:52 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/zhwikisource-20240501-page_signals.zst, curDomain=977, domain="zh.wikisource"
2024/05/17 15:22:55 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/zhwikiversity-20240501-page_signals.zst, curDomain=978, domain="zh.wikiversity"
2024/05/17 15:22:55 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/zhwikivoyage-20240501-page_signals.zst, curDomain=979, domain="zh.wikivoyage"
2024/05/17 15:22:56 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/zhwiktionary-20240501-page_signals.zst, curDomain=980, domain="zh.wiktionary"
2024/05/17 15:23:01 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/zuwikibooks-20240501-page_signals.zst, curDomain=981, domain="zu.wikibooks"
2024/05/17 15:23:02 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/zuwiki-20240501-page_signals.zst, curDomain=982, domain="zu.wikipedia"
2024/05/17 15:23:02 pagesignals.go:400: PageSignalsScanner.Scan(): opening page_signals/zuwiktionary-20240501-page_signals.zst, curDomain=983, domain="zu.wiktionary"
2024/05/17 15:23:02 pagesignals.go:394: PageSignalsScanner.Scan(): finished last domain
2024/05/17 15:23:02 pagesignals.go:423: PageSignalsScanner.Scan(): cleaning up
2024/05/17 15:23:02 itemsignals.go:215: BuildItemSignals(): finished testing LineMerger, returned 3493549482 lines, 0 of which were mis-ordered
2024/05/17 15:23:02 main.go:68: qrank-builder exiting
brawer commented 4 months ago

Currently, the pipeline (built at commit bd246b2fa8cec8a22a81a1db3bf56b80a91cf2b5) just counts the number of lines returned by LineMerger. Despite the trivial implementation, we still see failures. Interestingly, the failures are intermittent: we always process the same data with the same code, but some of the hourly runs succeed while others fail.

It’s tempting to blame Wikimedia’s datacenter for being flaky, but it seems very unlikely that intermittent network problems would always manifest while reading internal data files for zh.wikipedia.org.

Looking for potential sources of non-determinism, one thing comes to mind: To merge 52 weekly pageview counts with other page signals, LineMerger operates on a min-heap of 53 input streams. In the current implementation, these are partially ordered. If a Wikipedia page happens to have the same pageview counts in multiple weeks, any one of those weekly files may become the next top of heap. While this does not affect the correctness of the computed output (the returned lines are in correct order, as verified) it makes the processing order non-determinisitic.

2024/05/17 17:12:46 itemsignals.go:215: BuildItemSignals(): finished testing LineMerger, returned 3493549482 lines, 0 of which were mis-ordered
2024/05/17 18:20:51 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/17 19:11:57 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/17 20:33:09 itemsignals.go:215: BuildItemSignals(): finished testing LineMerger, returned 3493549482 lines, 0 of which were mis-ordered
2024/05/17 21:36:52 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/17 22:38:35 itemsignals.go:215: BuildItemSignals(): finished testing LineMerger, returned 3493549482 lines, 0 of which were mis-ordered
2024/05/17 23:29:53 itemsignals.go:215: BuildItemSignals(): finished testing LineMerger, returned 3493549482 lines, 0 of which were mis-ordered
2024/05/18 00:11:36 itemsignals.go:215: BuildItemSignals(): finished testing LineMerger, returned 3493549482 lines, 0 of which were mis-ordered
2024/05/18 04:03:54 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/18 05:11:37 itemsignals.go:215: BuildItemSignals(): finished testing LineMerger, returned 3493549482 lines, 0 of which were mis-ordered
2024/05/18 06:14:58 itemsignals.go:215: BuildItemSignals(): finished testing LineMerger, returned 3493549482 lines, 0 of which were mis-ordered
brawer commented 4 months ago

Removed the source of non-determinism, deployed a fresh binary at commit a4e93bf98d21eb4dcf18c78b8cff3cf3cc26d0da, scheduled hourly runs. Let’s see what happens.

brawer commented 4 months ago

We still see intermittent failures: Sometimes it’s fine, sometimes we get an EOF error.

2024/05/18 10:32:36 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/18 11:22:55 itemsignals.go:215: BuildItemSignals(): finished testing LineMerger, returned 3493549482 lines, 0 of which were mis-ordered
2024/05/18 12:25:36 itemsignals.go:215: BuildItemSignals(): finished testing LineMerger, returned 3493549482 lines, 0 of which were mis-ordered
2024/05/18 13:21:17 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/18 14:51:05 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/18 15:37:17 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/18 16:22:00 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/18 17:47:09 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/18 18:21:15 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/18 19:52:18 itemsignals.go:215: BuildItemSignals(): finished testing LineMerger, returned 3493549482 lines, 0 of which were mis-ordered
2024/05/18 20:29:56 itemsignals.go:215: BuildItemSignals(): finished testing LineMerger, returned 3493549482 lines, 0 of which were mis-ordered
2024/05/18 21:41:01 itemsignals.go:215: BuildItemSignals(): finished testing LineMerger, returned 3493549482 lines, 0 of which were mis-ordered
2024/05/18 22:58:14 itemsignals.go:215: BuildItemSignals(): finished testing LineMerger, returned 3493549482 lines, 0 of which were mis-ordered
2024/05/19 00:40:28 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/19 01:21:03 itemsignals.go:215: BuildItemSignals(): finished testing LineMerger, returned 3493549482 lines, 0 of which were mis-ordered
2024/05/19 03:29:30 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/19 04:43:15 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/19 05:22:19 itemsignals.go:215: BuildItemSignals(): finished testing LineMerger, returned 3493549482 lines, 0 of which were mis-ordered
2024/05/19 06:37:20 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/19 07:30:05 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/19 09:29:48 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/19 10:58:33 itemsignals.go:212: LineMerger failed: unexpected EOF
brawer commented 4 months ago

LineMerger now logs its internal state upon error, due to commit ffd719145eb06a2824eaff1d31a12d50eefa1f6a. Last night, it ran seven times with the new code. Three of these runs (same code on same data) were successful, four have failed:

2024/05/19 17:19:59 itemsignals.go:215: BuildItemSignals(): finished testing LineMerger, returned 3493549482 lines, 0 of which were mis-ordered
2024/05/19 19:04:39 itemsignals.go:215: BuildItemSignals(): finished testing LineMerger, returned 3493549482 lines, 0 of which were mis-ordered
2024/05/19 21:12:06 itemsignals.go:215: BuildItemSignals(): finished testing LineMerger, returned 3493549482 lines, 0 of which were mis-ordered
2024/05/20 00:59:11 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/20 02:54:32 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/20 03:25:12 itemsignals.go:212: LineMerger failed: unexpected EOF
2024/05/20 05:18:52 itemsignals.go:212: LineMerger failed: unexpected EOF

As observed before, the unexpected EOF errors happened non-deterministically at different offsets in different files. However, if an error occurs, it’s always while processing the Chinese Wikipedia.

2024/05/20 00:59:11 linemerger.go:82: LineMerger: scanner "pageviews/pageviews-2024-W12.zst" failed, lastLine="zh.wikipedia,2002412,3", err=unexpected EOF
2024/05/20 02:54:32 linemerger.go:82: LineMerger: scanner "pageviews/pageviews-2023-W45.zst" failed, lastLine="zh.wikipedia,2375954,1", err=unexpected EOF
2024/05/20 03:25:12 linemerger.go:82: LineMerger: scanner "pageviews/pageviews-2023-W48.zst" failed, lastLine="zh.wikipedia,2790350,1", err=unexpected EOF
2024/05/20 05:18:52 linemerger.go:82: LineMerger: scanner "pageviews/pageviews-2024-W13.zst" failed, lastLine="zh.wikipedia,2901336,3", err=unexpected EOF

Here’s when these jobs have started up:

2024/05/19 16:57:26 main.go:47: qrank-builder starting up                       
2024/05/19 18:41:58 main.go:47: qrank-builder starting up                       
2024/05/19 20:49:31 main.go:47: qrank-builder starting up                       
2024/05/20 00:33:43 main.go:47: qrank-builder starting up                       
2024/05/20 02:28:38 main.go:47: qrank-builder starting up                       
2024/05/20 02:59:02 main.go:47: qrank-builder starting up                       
2024/05/20 04:53:31 main.go:47: qrank-builder starting up                       

Looking at how long it took from startup to either success or failure, there seems to be an interesting pattern here: the successful jobs are faster than the failures. This smells heavlily like some component is setting a timeout on how long the input stream may stay open. And instead of reporting the timeout as an error, it just closes the stream??? How nasty.

Duration Status
00:22:33 OK
00:22:41 OK
00:22:35 OK
00:25:28 Failure
00:25:54 Failure
00:26:10 Failure
00:25:21 Failure

Next step: Analyze the runtimes of all previous runs.

brawer commented 4 months ago

Hm, it’s not really a consistent picture:

Termination Runtime Status
2024/05/14 08:52:43 00:30:09 Failure
2024/05/14 13:19:21 00:31:08 Failure
2024/05/14 17:04:17 00:29:29 Failure
2024/05/15 12:40:04 00:29:03 Failure
2024/05/15 14:52:23 00:30:01 Failure
2024/05/16 09:39:15 00:32:15 Failure
2024/05/16 11:52:17 00:33:19 Failure
2024/05/16 12:32:36 00:31:51 Failure
2024/05/16 13:23:44 00:32:34 Failure
2024/05/16 15:06:50 00:32:41 Failure
2024/05/17 14:41:38 00:22:22 Failure
2024/05/17 15:23:02 00:20:35 OK
2024/05/17 17:12:46 00:22:14 OK
2024/05/17 18:20:51 00:21:56 Failure
2024/05/17 19:11:57 00:22:08 Failure
2024/05/17 20:33:09 00:21:53 OK
2024/05/17 21:36:52 00:21:52 Failure
2024/05/17 22:38:35 00:21:56 OK
2024/05/17 23:29:53 00:22:11 OK
2024/05/18 00:11:36 00:22:14 OK
2024/05/18 04:03:54 00:22:21 Failure
2024/05/18 05:11:37 00:22:01 OK
2024/05/18 06:14:58 00:21:59 OK
2024/05/18 10:32:36 00:24:36 Failure
2024/05/18 11:22:55 00:24:28 OK
2024/05/18 12:25:36 00:24:17 OK
2024/05/18 13:21:17 00:24:03 Failure
2024/05/18 14:51:05 00:24:14 Failure
2024/05/18 15:37:17 00:24:18 Failure
2024/05/18 16:22:00 00:24:27 Failure
2024/05/18 17:47:09 00:24:10 Failure
2024/05/18 18:21:15 00:24:26 Failure
2024/05/18 19:52:18 00:24:31 OK
2024/05/18 20:29:56 00:24:26 OK
2024/05/18 21:41:01 00:24:26 OK
2024/05/18 22:58:14 00:24:20 OK
2024/05/19 00:40:28 00:23:58 Failure
2024/05/19 01:21:03 00:24:20 OK
2024/05/19 03:29:30 00:24:13 Failure
2024/05/19 04:43:15 00:24:48 Failure
2024/05/19 05:22:19 00:24:26 OK
2024/05/19 06:37:20 00:24:08 Failure
2024/05/19 07:30:05 00:24:26 Failure
2024/05/19 09:29:48 00:23:58 Failure
2024/05/19 17:19:59 00:22:33 OK
2024/05/19 19:04:39 00:22:41 OK
2024/05/19 21:12:06 00:22:35 OK
2024/05/20 00:59:11 00:25:28 Failure
2024/05/20 02:54:32 00:25:54 Failure
2024/05/20 03:25:12 00:26:10 Failure
2024/05/20 05:18:52 00:25:21 Failure
2024/05/20 07:10:57 00:43:09 Failure
brawer commented 4 months ago

Commit 3081f6ee1e1b44223f6ceab5f8d6cb2047256e9a downloads all 52 pageviews files to /tmp before processing. A first run in Wikimedia’s datacenter was successful, but that doesn’t mean much given the non-determinism. Let’s wait for the outcome of a couple hourly runs to see if this fixes the problem.

brawer commented 4 months ago

The fix seems to have worked.

brawer commented 4 months ago

It generally seems to work now, but we still see occasional network failures. Not sure if we can do anything here. We could implement exponential back-off, but it would be complicated. In case of failure, we already re-try a day later, which may be good enough as a work-around for the flakiness of Wikimedia’s datacenter.

2024/05/23 12:48:23 main.go:47: qrank-builder starting up
2024/05/23 12:48:31 build.go:23: found wikimedia dumps for 991 sites
2024/05/23 12:48:31 itemsignals.go:156: building public/item_signals-20240520.csv.zst
2024/05/23 12:54:40 pagesignals.go:381: PageSignalsScanner.Scan(): failed, domain=commons.wikimedia, err=read tcp 192.168.13.145:42564->185.15.56.161:443: read: connection reset by peer
2024/05/23 12:54:40 pagesignals.go:413: PageSignalsScanner.Scan(): cleaning up
2024/05/23 12:54:40 itemsignals.go:231: LineMerger failed: read tcp 192.168.13.145:42564->185.15.56.161:443: read: connection reset by peer
2024/05/23 12:54:40 itemsignals.go:242: BuildItemSignals(): sorting canceled, groupCtx.Err()=context canceled
2024/05/23 12:54:40 itemsignals.go:262: BuildItemSignals(): group.Wait() failed, err==read tcp 192.168.13.145:42564->185.15.56.161:443: read: connection reset by peer
2024/05/23 12:54:40 main.go:63: ComputeQRank failed: read tcp 192.168.13.145:42564->185.15.56.161:443: read: connection reset by peer
brawer commented 4 months ago

Several runs have finished. Closing this bug as fixed.