Open derekbruening opened 5 years ago
Happened again: https://ci.appveyor.com/project/DynamoRIO/dynamorio/builds/22313260
00:06:34] test 213
[00:06:34] Start 213: code_api|tool.reuse_distance
[00:06:34]
[00:06:34] 213: Test command: C:\projects\dynamorio\build\build_debug-internal-32\bin32\drrun.exe "-s" "90" "-quiet" "-debug" "-use_dll" "C:/projects/dynamorio/build/build_debug-internal-32/lib32/debug/dynamorio.dll" "-exit0" "-stderr_mask" "0xC" "-msgbox_mask" "0" "-dumpcore_mask" "0x7d" "-staged" "-code_api" "-t" "drcachesim" "-ipc_name" "drtestpipe_reuse_distance" "-simulator_type" "reuse_distance" "-reuse_distance_threshold" "256" "--" "C:/projects/dynamorio/build/build_debug-internal-32/suite/tests/bin/simple_app.exe"
[00:06:34] 213: Test timeout computed to be: 600
[00:06:34] 213: Hello, world!
[00:06:34] 213: Assertion failed: type_is_instr(cur_ref.instr.type), file ..\..\clients\drcachesim\reader\reader.cpp, line 132
This same assert just happened in code_api|tool.drcachesim.warmup-zeros
https://ci.appveyor.com/project/DynamoRIO/dynamorio/builds/23205784
00:20:18] Hello, world!
[00:20:18] Assertion failed: type_is_instr(cur_ref.instr.type), file ..\..\clients\drcachesim\reader\reader.cpp, line 132
warmup-zeros failed again with that assert: https://ci.appveyor.com/project/DynamoRIO/dynamorio/builds/23571201
Happened on the missfile test: https://ci.appveyor.com/project/DynamoRIO/dynamorio/builds/23634836
Happened on reuse_distance: https://ci.appveyor.com/project/DynamoRIO/dynamorio/builds/25011472
On missfile again: https://ci.appveyor.com/project/DynamoRIO/dynamorio/builds/25102162
On TLB-threads again: https://ci.appveyor.com/project/DynamoRIO/dynamorio/builds/25727905
Happened in tool.drcachesim.coherence with a timeout being reported: #3803.
On tool.drcachesim.coherence: https://ci.appveyor.com/project/DynamoRIO/dynamorio/builds/27243222
Keeps happening, this time on tool.drcachesim.TLB-simple: https://ci.appveyor.com/project/DynamoRIO/dynamorio/builds/27705908
From code inspection I think the problem is that type_is_instr deliberately excludes TRACE_TYPE_INSTR_NO_FETCH.
Well unfortunately this is still happening, so I only fixed part of the bug or sthg:
https://ci.appveyor.com/project/DynamoRIO/dynamorio/builds/27867452
Happened in both of these tests:
[00:24:15] Assertion failed: type_is_instr(cur_ref.instr.type) || cur_ref.instr.type == TRACE_TYPE_INSTR_NO_FETCH, file ..\..\clients\drcachesim\reader\reader.cpp, line 133
[00:24:52] 188: Assertion failed: type_is_instr(cur_ref.instr.type) || cur_ref.instr.type == TRACE_TYPE_INSTR_NO_FETCH, file ..\..\clients\drcachesim\reader\reader.cpp, line 133
Happened again on the coherence test: https://ci.appveyor.com/project/DynamoRIO/dynamorio/builds/35100734
Happened again in 32-bit simple-config-file test: https://github.com/DynamoRIO/dynamorio/pull/4757/checks?check_run_id=1990950960
Quoting from #5039 duplicate: code_api|tool.drcachesim.coherence failed in vs2017-32: https://github.com/DynamoRIO/dynamorio/runs/3252621878
2021-08-05T13:30:48.1272424Z Invalid trace entry type 23 before a bundle
2021-08-05T13:30:48.1273979Z Assertion failed: type_is_instr(cur_ref_.instr.type) || cur_ref_.instr.type == TRACE_TYPE_INSTR_NO_FETCH, file D:\a\dynamorio\dynamorio\clients\drcachesim\reader\reader.cpp, line 155
2021-08-05T13:30:48.1276519Z abort() has been called
I've been looking into test failures on the SVE machine we want to add to the CI and one of the remaining failures is scatter-gather which hits this issue.
My observations are:
I'm rather suspicious that this is a case of another test interfering with it if they both run at once- this would explain why it is more intermittent elsewhere but more frequent on the longer test
I don't think anyone has been able to reproduce this so if you have a local reproduction could you get the record type that is triggering the assert?
I can give it a shot, but it only fails when run as part of the CI suite. It may not reproduce if I modify the run parameters.
Here's the full assert, which looks like it includes the record type:
Invalid trace entry type thread_exit (23) before a bundle
236: drcachesim: /home/ubuntu/dynamorio/clients/drcachesim/reader/reader.cpp:238: virtual
bool dynamorio::drmemtrace::reader_t::process_input_entry(): Assertion `type_is_instr(cur_
ref_.instr.type) || cur_ref_.instr.type == TRACE_TYPE_INSTR_NO_FETCH' failed.
Hey, do you need any more information on this?
Oh, I think I added that printout to help diagnose this on GA CI. Not sure I understand the thread exit followed by a bundle though: are they different threads?
There aren't any threads involved in this particular application, it's a single threaded test that just runs through all the scatter/gather instructions.
This may be the same as #4167. Closing that in favor of this.
For online we have multiple threads interleaved on the same pipe, with thread headers identifying whose data the subsequent records belong to. We split large buffers up to be below the max atomic pipe size and insert headers for each piece. One theory for the multi-threaded cases here (such as tool.drcachesim.multiproc
which hits this (the assert doesn't fire but the diagnostic message prints, failing the output regex) in post-merge runs: 2x in the last 20 runs or so; although: should check: maybe that test has a single thread in each of two processes?) is that the atomic split has some problem and the reader sees mixed data for two different threads: but the single-threaded cases hitting this seem to break that theory.
Another theory is that the scheduler is inserting a synthesized thread exit record: but that only happens for regions-of-interest.
Other theories about out-of-order pipe buffers: the buffer split only splits right before TRACE_TYPE_INSTR, so it shouldn't split between that and an instr bundle.
Re: https://github.com/DynamoRIO/dynamorio/issues/3320#issuecomment-1904156338 theorizing that this involves interference between tests: two tests with the same named pipe? That is promising. Online tests are supposed to pass -ipc_name with a unique name. I think I do see some tests missing that.
Actually there aren't any missing:
$ grep '[ /]drcachesim' suite/tests/CTestTestfile.cmake | egrep -v 'offline|ipc_name|indir|infile' | cut -c 1-40
add_test(drmemtrace_proj "/usr/bin/ctest
add_test(code_api|tool.drpt2trace.sideba
add_test(code_api|tool.drpt2trace.elf "/
Shows only drpt2trace and build-and-test. Besides, any missing would only collide with each other, not with the ones we're seeing fail.
The pipe names specified do seem to all be unique, except there are 2 instances of drtestpipe_simple
:
$ grep -o 'ipc_name" "[^ ]*"' suite/tests/CTestTestfile.cmake | sort | uniq -c
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_allasm-repstr-basic-counts"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_allasm-scattergather-basic-counts"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_basic_counts"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_builtin-prefetch-basic-counts"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_coherence"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_core_on_disk"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_core_on_disk_schedule"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_core_serial"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_core_sharded"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_counts_only_thread"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_delay-global"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_delay-simple"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_drstatecmp-delay-simple"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_filter-asm"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_filter-d"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_filter-i"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_filter-no-d"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_filter-no-i"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_filter-simple"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_histogram"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_instr-only-trace"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_invariants"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_miss_analyzer"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_missfile"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_missfile-config-file"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_multiproc"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_opcode_mix"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_reuse_distance"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_reuse_offline"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_reuse_offline_threads"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_reuse_time_offline"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_scattergather-x86"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_schedule_stats_nopreempt"
2 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_simple"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_simple-config-file"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_simulate_as_traced"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_switch_insertion"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_syscall-mix"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_threads"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_threads-with-config-file"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_TLB-simple"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_TLB-threads"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_warmup-valid"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_warmup-zeros"
1 ipc_name" "...build_x64_dbg_tests/suite/tests/drtestpipe_windows-simple"
Aha it's two different-option instances:
code_api|tool.drcachesim.simple
code_api,satisfy_w_xor_x|tool.drcachesim.simple
Hmm, we do see more failures in the post-merge long suite which might have more parallel-option tests? Have all failures been in the long suite? I don't think so: I assume the SVE ones above are in a pre-merge suite?
Plus, if I set TEST_LONG=ON
I still only see the drcachesim.simple
collision.
The suite/tests/drtestpipe_scattergather-aarch64
seems to be unique on aarch64 too.
Re: the two drcachesim.simple
tests having the same -ipc_name
: we already set _self_serial
for all drcachesim tests, so they will not run in parallel.
I've done quite a bit of (very frustrating) investigation on this issue. It seems to only occur when run as part of a suite of tests. I've run these tests together 100s of times and never had an issue when they're run via ctest on their own or only with other drcachesim tests.
It seems not to be a failure in any specific drcachesim test, but some seem more prone to it than others.
Here's the rough statistics I gathered on the failure rates (in percentages)
{
"debug": {
"linux.fib-conflict": 100,
"linux.fib-conflict-early": 100,
"api.static_sideline_FLAKY": 100,
"api.static_signal": 3,
"tool.drcacheoff.burst_threads_counts": 3,
"client.detach_test": 2,
"5:tool.scheduler.unit_tests": 1,
"linux.fork": 1,
"tool.drcachesim.scattergather-aarch64": 1,
"drcachesim_general": 3,
"tool.drcachesim.threads-with-config-file": 1,
"tool.drcachesim.threads": 1
},
"release": {
"tool.drcachesim.scattergather-aarch64": 2,
"tool.drcacheoff.burst_threads_counts": 4,
"drcachesim_general": 21,
"tool.drcachesim.threads-with-config-file": 7,
"tool.drcachesim.threads": 6,
"tool.drcachesim.coherence": 4,
"tool.drcachesim.miss_analyzer": 2,
"tool.drcachesim.drstatecmp-fuzz": 1,
"tool.drcachesim.TLB-threads": 1
}
}
The particularly interesting thing about the above results is that it was run on a non-sve machine, so the scattergather test is basically a no-op: it sets up the app and then exits without doing any work
On the a64 Jenkins machine in a release build I may be able to reproduce it by running with other tests. The final run didn't fail so I don't have the logs to confirm it is this failure mode but that's almost certainly what it is as we have no other open bugs on these tests. Here it failed about one out of 11 runs:
derek@dynamorio:~/dr/build_rel$ for ((i=0; i<500; i++)); do echo $i; ctest -j 13 -R drcachesim -E statecmp; done > OUT 2>&1
derek@dynamorio:~/dr/build_rel$ grep -c '(Failed' OUT
40
derek@dynamorio:~/dr/build_rel$ grep '(Failed' OUT | sort | uniq -c
8 220 - code_api|tool.drcachesim.threads (Failed)
6 221 - code_api|tool.drcachesim.threads-with-config-file (Failed)
10 222 - code_api|tool.drcachesim.coherence (Failed)
13 223 - code_api|tool.drcachesim.TLB-threads (Failed)
1 227 - code_api|tool.drcachesim.multiproc (Failed)
2 231 - code_api|tool.drcachesim.scattergather-aarch64 (Failed)
I put in a circular buffer to see the history of records at the error point:
ZZ Invalid trace entry type thread_exit (23 tid=709330 vs cur tid=709330) before a bundle
ZZ #0: 10 4 281473369425176 instr
ZZ #1: 13 4 281473369425180 cond jmp
ZZ #2: 10 4 281473369425184 instr
ZZ #3: 14 4 281473369425188 call
ZZ #4: 10 4 281473369874768 instr
ZZ #5: 17 1 6678055395111534340 bundle
ZZ #6: 10 4 281473369874776 instr
ZZ #7: 0 8 281473370701328 read
ZZ #8: 10 4 281473369874780 instr
ZZ #9: 17 2 6677609268268499972 bundle
ZZ #10: 17 2 6677609268268499972 <bundle 2nd iter>
ZZ #11: 10 4 281473369874792 instr
ZZ #12: 28 2 13356146762348358 timestamp
ZZ #13: 28 3 57 cpuid
ZZ #14: 28 25 94 syscall 94 == SYS_exit_group
ZZ #15: 23 4 709330 thread exit
ZZ next: 17 1 6677650603033820932 == 0x5cabc7d80000ff04
So we have a normal exit with a bundle record suddenly appearing afterward.
Further experiments show the bundle belongs to this same exiting thread. It belongs to a basic block that is not directly adjacent to the exiting block.
Making the pipe buffer size considered atomic to be 512 instead of 4K: error still happens.
Making the pipe buffer size considered to be atomic 80K so there's no manual splitting: the error shows up earlier with a cpuid marker before a bundle.
Still not clear what is going on. Data arriving out of order over the pipe only when the kernel is under stress?!?
Could it be a problem with the queue on the reader side? I put a printout there and no data is ever read from the queue here. Maybe that is not surprising: it's only used with skipping.
To see if the buffers are messed up at writing time I added a check in atomic_pipe_write() for a bundle type after a non-instr, non-bundle record: it loops over the buffer and checks. No hit when the problem is reproduced: so it seems that the buffer manipulation has no error and it's somewhere between the write to the named pipe and the read that messes up. The ipc_reader_t + reader_t layers are pretty thin: I ruled out the queue already, which doesn't leave much besides the kernel as the culprit.
Yet, would the original Windows instances be the same underlying culprit if this were some kernel issue? That seems unlikely.
I tried to repro on an x86 machine w/ a different kernel: it reproduces there too. Yet we've never seen it in the x86 GA CI VM's (have we?)
The error on a marker with the 80K-atomic setting where we don't split the buffers looks like this:
ZZ Invalid trace entry type marker (28 tid=2654310 vs cur tid=2654310) before a bundle
ZZ #0: 17 2 0x91734aa800000404
ZZ #1: 10 4 0xffff91734b1c
ZZ #2: 1 8 0xffff8de18640
ZZ #3: 10 4 0xffff91734b20
ZZ #4: 0 4 0xffff90f23998
ZZ #5: 10 4 0xffff91734b24
ZZ #6: 10 4 0xffff91734b28
ZZ #7: 1 4 0xffff90f23998
ZZ #8: 10 4 0xffff91734b2c
ZZ #9: 0 4 0xffff90f23998
ZZ #10: 10 4 0xffff91734b30
ZZ #11: 0 4 0xffff90f2399c
ZZ #12: 10 4 0xffff91734b34
ZZ #13: 13 4 0xffff91734b38
ZZ #14: 28 2 0x2f73a4aeda72b6
ZZ #15: 28 3 0x9
ZZ next: 17 1 0x91734b4c00000004 91734b4c is the tag for this bundle
ZZ next: 10 4 0xffff91734b9c
ZZ next: 0 8 0xffff8de10d98
ZZ next: 10 4 0xffff91734ba0
ZZ next: 17 2 0x91734b4c00000404 91734b4c again so further in same block
ZZ next: 17 2 0x91734b4c00000404 2nd part of bundle
ZZ next: 10 4 0xffff91734bac
ZZ next: 0 8 0xffff91746038
ZZ next: 10 4 0xffff91734bb0
ZZ next: 0 4 0xffff9072299c
ZZ next: 10 4 0xffff91734bb4
ZZ next: 17 2 0x91734b4c00000404 91734b4c again so yet further in same block
ZZ next: 17 2 0x91734b4c00000404 2nd part of bundle
ZZ next: 10 4 0xffff91734bc0
ZZ next: 1 8 0xffff8de18638
ZZ next: 10 4 0xffff91734bc4
ZZ next: 0 4 0xffff90722998
ZZ next: 10 4 0xffff91734bc8
ZZ next: 10 4 0xffff91734bcc
ZZ next: 1 4 0xffff90722998
ZZ next: 10 4 0xffff91734bd0
ZZ next: 0 4 0xffff90722998
ZZ next: 10 4 0xffff91734bd4
ZZ next: 0 4 0xffff9072296c
ZZ next: 10 4 0xffff91734bd8
ZZ next: 13 4 0xffff91734bdc cond branch, presmably backward
ZZ next: 10 4 0xffff91734b4c the real 91734b4c block start
ZZ next: 17 1 0x91734b4c00000004 the real 91734b4c bundle
ZZ next: 10 4 0xffff91734b54
ZZ next: 0 8 0xffff91746038
ZZ next: 10 4 0xffff91734b58 presumably straight-line to the 4b9c up above
ZZ next: 0 4 0xffff9072299c
ZZ pipe_.read() => 196608 hit end of per-record print; this is later; so
ZZ pipe_.read() => 196608 records above were already in the buffer
This has the atomic_pipe_write() check in place and it didn't write anything out complaining about a misplaced bundle when writing to the pipe: so this was not the actual record sequence written.
So the "bad" bundle seems to be part of a sequence that is somehow misplaced: the data is in the wrong place (which matches the thread exit case), or data in between is missing. The large size without splits might have races but these tests that fail are often single-threaded.
Happened on riscv reuse-distance: Invalid trace entry type thread_exit (23) before a bundle
https://github.com/DynamoRIO/dynamorio/actions/runs/11171065544/job/31055108423?pr=7018
Data point: Another failure, in master:
====> FAILURE in release-external-64 <====
release-external-64: 116 tests passed, **** 1 tests failed, but ignoring 0 for i#5365: ****
code_api|tool.drcachesim.TLB-threads
https://github.com/DynamoRIO/dynamorio/actions/runs/11185116691/job/31097308404
The 32-bit version failed once: https://ci.appveyor.com/project/DynamoRIO/dynamorio/builds/21162519