Open kuhanov opened 2 years ago
Xref #4279
@AssadHashmi is there any ongoing work looking at AArch64 overhead? There seem to be several cases of significant overhead for DR on AArch64 compared to DR on x86: this issue, #4279.
Please run with -prof_pcs
and get the category breakdown to see where time is spent. This is a critical step to understand the behavior.
The top functions in the profile image above:
clear_icache: This is used on every pieces of new code added to the code cache (not used on flushing the cache AFAIK: so that would make this being slow a problem for any large app and not just Java?) Looking at the code: it looks pretty heavyweight. @AssadHashmi -- could this be skipped for brand new code in a fresh mmap? Xref #1911.
fragment_remove_from_ibt_tables: This would only be slow due to the lock it grabs? -shared_bb_ibt_tables
is on by default: try -no_shared_bb_ibt_tables
to get rid of the global lock?
But I would want to see the whereami breakdown from -prof_pcs
first.
- fragment_remove_from_ibt_tables: This would only be slow due to the lock it grabs?
-shared_bb_ibt_tables
is on by default: try-no_shared_bb_ibt_tables
to get rid of the global lock?
test with -no_shared_bb_ibt_tables
took 10m34.431s vs 14m29.876s.
But I would want to see the whereami breakdown from
-prof_pcs
first.
Decrease the size of MyFile.txt to 100mb. cmd
~/dynamorio_9.0.19012/.build//bin64/drrun -prof_pcs -prof_pcs_heap_size 4G -vm_size 2G -vmheap_size 24G -- java -XX:-UseCompressedOops -XX:-UseCompressedClassPointers -XX:-TieredCompilation ReadWrite
result1:
DynamoRIO library: 0x0000000071000000-0x0000000071224000
ITIMER distribution (47883):
17.3% of time in INTERPRETER (8281)
80.7% of time in DISPATCH (38643)
0.0% of time in MONITOR (4)
0.0% of time in SYSCALL HANDLER (2)
0.6% of time in SIGNAL HANDLER (290)
0.3% of time in CONTEXT SWITCH (165)
0.2% of time in INDIRECT BRANCH LOOKUP (111)
0.8% of time in FRAGMENT CACHE (379)
0.0% of time in UNKNOWN (8)
cmd
~/dynamorio_9.0.19012/.build//bin64/drrun -no_shared_bb_ibt_tables -prof_pcs -prof_pcs_heap_size 4G -vm_size 2G -vmheap_size 24G -- java -XX:-UseCompressedOops -XX:-UseCompressedClassPointers -XX:-TieredCompilation ReadWrite
result2:
DynamoRIO library: 0x0000000071000000-0x0000000071224000
ITIMER distribution (47715):
16.0% of time in INTERPRETER (7643)
82.6% of time in DISPATCH (39421)
0.0% of time in MONITOR (3)
0.0% of time in SYSCALL HANDLER (1)
0.1% of time in SIGNAL HANDLER (56)
0.3% of time in CONTEXT SWITCH (143)
0.2% of time in INDIRECT BRANCH LOOKUP (107)
0.7% of time in FRAGMENT CACHE (337)
0.0% of time in UNKNOWN (4)
result1-java.or.68074.00000000-100mb-180secs-pcsamples.0.68074.log.zip
result2-java.or.29080.00000000-100mb-180secs-no-shared-pcsamples.0.29080.log.zip
Another interesting this is I played with the size of input file for example, 20 mb file took 30secs, but 40mb took just 6 sec
attached prof_cps for such cases too java.or.17127.00000000-40mb-6sec.zip java.or.91437.00000000-20mb-30sec.zip
Kirill
16.0% of time in INTERPRETER (7643)
82.6% of time in DISPATCH (39421)
Looking like #4279: tons of time in dispatch; almost nothing in cache. I would say, like there, to try -thread_private
to see the impact on removing the locks, but unfortunately nobody ever added that support to AArch64: #1884.
Would it be possible to get the pcsamples for x86? Does it build the same number of blocks, and has the same number of threads and same thread sharing behavior? How many blocks are built (-rstats_to_stderr
)?
How many blocks are built (
-rstats_to_stderr
)?
x86
DynamoRIO statistics:
Peak threads under DynamoRIO control : 20
Threads ever created : 20
Total signals delivered : 3
System calls, pre : 752
System calls, post : 642
Application mmaps : 191
Application munmaps : 27
Synch attempt failure b/c not at safe spot : 64
Basic block fragments generated : 188184
Peak fcache combined capacity (bytes) : 6344704
Peak fcache units on live list : 130
Peak fcache units on free list : 5
Peak special heap capacity (bytes) : 1994752
Peak heap units on live list : 161
Peak heap units on free list : 5
Peak stack capacity (bytes) : 1859584
Peak heap capacity (bytes) : 20021248
Peak total memory from OS (bytes) : 35102720
Peak vmm blocks for unreachable heap : 5171
Peak vmm blocks for stack : 536
Peak vmm blocks for unreachable special heap : 120
Peak vmm blocks for unreachable special mmap : 140
Peak vmm blocks for reachable heap : 184
Peak vmm blocks for cache : 2080
Peak vmm blocks for reachable special heap : 540
Peak vmm blocks for reachable special mmap : 7
Peak vmm virtual memory in use (bytes) : 35954688
AArch64
DynamoRIO statistics:
Peak threads under DynamoRIO control : 73
Threads ever created : 73
Total signals delivered : 32
System calls, pre : 2064
System calls, post : 1741
Application mmaps : 482
Application munmaps : 56
Synch attempt failure b/c not at safe spot : 161
Basic block fragments generated : 5466007
Peak fcache combined capacity (bytes) : 8933376
Peak fcache units on live list : 224
Peak fcache units on free list : 15
Peak special heap capacity (bytes) : 299008
Peak heap units on live list : 363
Peak heap units on free list : 18
Peak stack capacity (bytes) : 6635520
Peak heap capacity (bytes) : 18182144
Peak total memory from OS (bytes) : 53592064
Peak vmm blocks for unreachable heap : 5792
Peak vmm blocks for stack : 1914
Peak vmm blocks for unreachable special heap : 876
Peak vmm blocks for unreachable special mmap : 511
Peak vmm blocks for reachable heap : 608
Peak vmm blocks for cache : 3584
Peak vmm blocks for reachable special mmap : 7
Peak vmm virtual memory in use (bytes) : 54444032
Load-exclusive instrs converted to CAS : 56721
Store-exclusive instrs converted to CAS : 56686
Would it be possible to get the pcsamples for x86?
DynamoRIO library: 0x00007fd9c3a0d000-0x00007fd9c3c29000 ITIMER distribution (350): 25.1% of time in INTERPRETER (88) 34.9% of time in DISPATCH (122) 0.9% of time in SIGNAL HANDLER (3) 12.3% of time in CONTEXT SWITCH (43) 7.4% of time in INDIRECT BRANCH LOOKUP (26) 19.4% of time in FRAGMENT CACHE (68)
Wow:
Basic block fragments generated : 188184 <------- x86
Basic block fragments generated : 5466007 <------- aarch64
That's the same workload? a64 built 5 1/2 million blocks while x86 built <200K? Unfortunately these release-build stats don't have # of fragments deleted. But a64 should have more precise JIT handling, seeing cache flush requests as opposed to x86 where DR has to guess and act on page granularity and can't distinguish appending new JIT code to the end of a region from changing existing JIT code on that page.
Plus, x86 has traces, which should result in a little extra count of blocks created, while traces are not implement on a64.
Maybe the thing to do is try to get debug build stats for these 2 workloads? Is DR on a64 really doing 30x as many cache flushes as on x86, maybe due to tons of hardware cache flush requests from the JIT or something??
@AssadHashmi wondering if you have any insights here.
That's the same workload?
the same workload with the same MyFile.txt (~100mb)
Plus, x86 has traces, which should result in a little extra count of blocks created, while traces are not implement on a64.
we don't use traces for x86 too (-disable_traces option)
Maybe the thing to do is try to get debug build stats for these 2 workloads?
this one?
-debug -rstats_to_stderr
thre are no difference in output
DynamoRIO statistics:
Peak threads under DynamoRIO control : 20
Threads ever created : 20
Total signals delivered : 3
System calls, pre : 769
System calls, post : 661
Application mmaps : 193
Application munmaps : 29
Synch attempt failure b/c not at safe spot : 46
Basic block fragments generated : 189119
Peak fcache combined capacity (bytes) : 6369280
Peak fcache units on live list : 130
Peak fcache units on free list : 5
Peak special heap capacity (bytes) : 1990656
Peak heap units on live list : 170
Peak heap units on free list : 5
Peak stack capacity (bytes) : 1859584
Peak heap capacity (bytes) : 20914176
Peak total memory from OS (bytes) : 36143104
Peak vmm blocks for unreachable heap : 5425
Peak vmm blocks for stack : 536
Peak vmm blocks for unreachable special heap : 120
Peak vmm blocks for unreachable special mmap : 140
Peak vmm blocks for reachable heap : 184
Peak vmm blocks for cache : 2080
Peak vmm blocks for reachable special heap : 540
Peak vmm blocks for reachable special mmap : 7
Peak vmm virtual memory in use (bytes) : 36995072
The debug ones go to the logfile. To get them to the console this would work: -debug -loglevel 1 -log_to_stderr -logmask 0x1
. But there are probably several sets of them; we'd want just the last set.
x86
(Begin) All statistics @167699 (0:42.127):
Peak threads under DynamoRIO control : 20
Threads ever created : 20
Total signals delivered : 5
System calls, pre : 790
System calls, post : 692
System calls, pre, ignorable : 67
System calls, post, ignorable : 66
Ignorable system calls : 481
Non-ignorable system calls : 32
Application mmaps : 193
Application munmaps : 27
Application modules with code : 20
Application code seen (bytes) : 2293886
Interpreted calls, direct and indirect : 32684
Interpreted indirect calls : 4320
Interpreted indirect jmps : 14435
Interpreted rets : 14433
Dynamic option synchronizations : 1
Dynamic option synchronizations, no change : 1
Code origin addresses checked : 171485
Code origin addresses in last area : 155421
Writable code regions : 3296
Writable code regions we made read-only : 3296
Write faults on read-only code regions : 3300
Write fault races : 1
Cache consistency flushes : 3299
Cache consistency flushes that flushed nothing : 4
Synch attempt failure b/c not at safe spot : 41
Cache consistency non-code nop flushes : 8
Flushes that flushed >=1 shared fragment : 3295
Shared deletion regions unlinked : 6590
Shared deletion region walks : 49725
Shared deletion region at-syscall walks : 44022
Shared deletion ref count decrements : 52149
Shared deletion flushtime diff > #threads : 1027
Shared deletion max flushtime diff : 36
Shared deletion max pending : 35
Shared deletion region removals: ref 0 : 1265
Shared deletion region removals: at reset : 2030
Resets of entire fcache, proactively : 58
Resets due to too many pending deletions : 58
Generated code protection changes : 13199
Protection change calls : 19798
Protection change pages : 4139743
Fragments generated, bb and trace : 167699
Basic block fragments generated : 167699
Future fragments generated : 159585
Shared fragments generated : 167527
Shared bbs generated : 167527
Private fragments generated : 172
Private bbs generated : 172
Shared future fragments generated : 159241
Unique fragments generated : 96139
Maximum fragment requested size in bytes : 10051
Maximum fragment size in bytes : 10048
Maximum instrs in a bb : 257
BBs truncated due to instr limits : 24
Direct exit stubs created : 226129
Indirect exit stubs created : 32028
Separate stubs created : 135499
Rip-relative instrs seen : 30767
Rip-relative unreachable leas : 4642
Rip-relative unreachable leas in app : 4
Rip-relative unreachable non-leas : 26125
Rip-relative unreachable non-leas in app : 12
Rip-relative unreachable spills avoided : 12522
BBs with one indirect exit : 32028
BBs with one direct exit : 45213
BBs with two direct exits : 90458
BB direct exits >SHRT_MAX from fragment tag : 22907
BB direct exits <=SHRT_MAX from fragment tag : 203222
BB cbr fall-through <=SHRT_MAX from fragment tag : 90458
BBs using post-linkstub fragment offset : 45385
BBs that write OF but no other arithmetic flags : 122849
BBs that read a flag before writing any : 301
BBs that write no arithmetic flags : 48163
BBs that write no arithmetic flags, end in ib : 16044
Cbrs sharing a single exit stub : 90286
Fragments requiring post_linkstub offs : 45385
Fragments unlinked for flushing : 73113
Fragments deleted for any reason : 167699
Fragments unlinked for deletion : 73113
Fragments deleted on thread/process death : 94586
Fragments deleted for munmap or RO consistency : 73113
IBTs replaced unlinked fragments : 11977
IBT resizes : 133
Same-size IBT table resizes : 16
BB fragments targeted by IBL : 51165
Exits due to IBL cold misses : 54212
Fragments regenerated or duplicated : 71560
Recreated fragments, total : 3614
Recreations via app re-decode : 3614
Recreation spill value restores : 2
Branches linked, direct : 167312
Branches linked, indirect : 32028
Waits due to flushing : 1811
Fcache exits, total : 298850
Fcache exits, system call executions : 687
Fcache exits, flushed due to code mod : 3300
Fcache exits, asynch : 10
Fcache exits, proactive reset : 188
Fcache entrances aborted b/c target missing : 1467
Fcache exits, from BBs : 294665
Fcache exits, total indirect branches : 105377
Fcache exits, non-trace indirect branches : 105377
Fcache exits, ind target not in cache : 54211
Fcache exits, ind target in cache but not table : 51166
Fcache exits, from BB, ind target ... : 51166
Fcache exits, BB->BB, ind target ... : 51166
Fcache exits, dir target not in cache : 112805
Fcache exits, link not allowed : 75693
Fcache exits, non-ignorable system call : 790
Fcache exits, no link shared <-> private : 75693
Fcache exits needing cbr disambiguation : 96439
Fragments with OF restore prefix : 47571
Fcache bb peak capacity (bytes) : 81920
Fcache bb peak used (bytes) : 18188
Fcache bb align space (bytes) : -25
Fcache shared bb peak capacity (bytes) : 6361088
Fcache shared bb space used (bytes) : -4266776
Fcache shared bb peak used (bytes) : 3193060
Fcache shared bb align space (bytes) : 8901
Fcache shared bb empty space (bytes) : 174756
Fcache shared bb free coalesce prev : 49899
Fcache shared bb free coalesce next : 149325
Fcache shared bb return last : 2013
Fcache shared bb free use larger bucket : 59926
Fcache shared bb free split : 56916
Peak fcache combined capacity (bytes) : 6443008
Peak fcache units on live list : 131
Peak fcache units on free list : 5
Fcache unit lookups : 1846284
Peak special heap units : 34
Special heap align space (bytes) : 160
Peak special heap capacity (bytes) : 2019328
Peak heap units on live list : 171
Peak heap units on free list : 5
Peak heap align space (bytes) : 193331
Peak heap bucket pad space (bytes) : 34824
Heap allocs in buckets : 2379469
Heap allocs variable-sized : 1474
Total reserved memory : 131072
Peak total reserved memory : 36642816
Guard pages, reserved virtual pages : 4
Peak guard pages, reserved virtual pages : 836
Current stack capacity (bytes) : 114688
Peak stack capacity (bytes) : 1859584
Mmap capacity (bytes) : -1077248
Peak mmap capacity (bytes) : 6672384
Mmap reserved but not committed (bytes) : -6701056
Peak mmap reserved but not committed (bytes) : 1126400
Peak heap claimed (bytes) : 18971153
Current heap capacity (bytes) : 2031616
Peak heap capacity (bytes) : 21381120
Heap reserved but not committed (bytes) : 393216
Peak heap reserved but not committed (bytes) : 4190208
Current total memory from OS (bytes) : 131072
Peak total memory from OS (bytes) : 36642816
Current vmm blocks for unreachable heap : 192
Peak vmm blocks for unreachable heap : 5531
Current vmm blocks for stack : 32
Peak vmm blocks for stack : 536
Peak vmm blocks for unreachable special heap : 120
Peak vmm blocks for unreachable special mmap : 140
Current vmm blocks for reachable heap : 16
Peak vmm blocks for reachable heap : 184
Peak vmm blocks for cache : 2096
Peak vmm blocks for reachable special heap : 540
Peak vmm blocks for reachable special mmap : 7
Our virtual memory blocks in use : 240
Peak our virtual memory blocks in use : 9154
Allocations using multiple vmm blocks : 426
Blocks used for multi-block allocs : 9812
Current vmm virtual memory in use (bytes) : 983040
Peak vmm virtual memory in use (bytes) : 37494784
Total times mutexes acquired : 9404971
Total times mutexes contended : 922572
Number of safe reads : 208839
Number of vmarea vector resize reallocations : 2
Peak vmarea vector length : 195
Peak dynamo areas vector length : 1
Peak executable areas vector length : 21
-pad_jmps fragments size overestimated : 126276
-pad_jmps excess instances coalesced w/ nxt free : 43779
-pad_jmps excess instances failed to be returned : 9179
-pad_jmps excess bytes failed to be returned : 104176
-pad_jmps body bytes shared bb : 502581
-pad_jmps excess bytes shared bb : 477335
Bytes shared frags ever : 8812542
-pad_jmps start_pcs shifted shared bb : 12841
-pad_jmps start_pcs shifted bytes shared bb : 25246
-pad_jmps excess bytes released shared bb : 486236
-pad_jmps no pad exits shared bb : 244972
-pad_jmps body bytes bb : 516
-pad_jmps excess bytes bb : 505
Bytes bb frags ever : 16852
-pad_jmps start_pcs shifted bb : 5
-pad_jmps start_pcs shifted bytes bb : 11
-pad_jmps excess bytes released bb : 480
-pad_jmps no shift stubs bb : 344
-pad_jmps no pad exits bb : 339
Num synch yields : 39
Num synch loops in wait_at_safe_spot : 81
App reference with FS/GS seg being mangled : 497
(End) All statistics
AArch64
(Begin) All statistics @8787489 (6:40.937):
Peak threads under DynamoRIO control : 73
Threads ever created : 73
Total signals delivered : 28
System calls, pre : 2864
System calls, post : 2541
System calls, pre, ignorable : 551
System calls, post, ignorable : 484
Ignorable system calls : 321320
Non-ignorable system calls : 31
Application mmaps : 482
Application munmaps : 66
Application modules with code : 17
Application code seen (bytes) : 1841476
Interpreted calls, direct and indirect : 323937
Interpreted indirect calls : 85636
Interpreted indirect jmps : 1562310
Interpreted rets : 426312
Dynamic option synchronizations : 1
Dynamic option synchronizations, no change : 1
Code origin addresses checked : 9108923
Code origin addresses in last area : 9044136
Writable code regions : 1
Cache consistency flushes : 41799
Cache consistency flushes via synchall : 41799
Thread not translated in synchall flush (race) : 153
Synch attempt failure b/c not at safe spot : 386
Cache consistency non-code nop flushes : 2
Shared deletion region walks : 1
Generated code protection changes : 3
Protection change calls : 9
Protection change pages : 87
Fragments generated, bb and trace : 8787489
Basic block fragments generated : 8787489
Future fragments generated : 8783677
Shared fragments generated : 8786947
Shared bbs generated : 8786947
Private fragments generated : 542
Private bbs generated : 542
Shared future fragments generated : 8782593
Unique fragments generated : 84209
Maximum fragment requested size in bytes : 1608
Maximum fragment size in bytes : 1608
Maximum instrs in a bb : 257
BBs truncated due to instr limits : 33
Direct exit stubs created : 12467301
Indirect exit stubs created : 2074225
BBs with one indirect exit : 2074225
BBs with one direct exit : 959227
BBs with two direct exits : 5754037
BB direct exits >SHRT_MAX from fragment tag : 179784
BB direct exits <=SHRT_MAX from fragment tag : 12287517
BB cbr fall-through <=SHRT_MAX from fragment tag : 5754037
BBs using post-linkstub fragment offset : 959769
BBs that read a flag before writing any : 25000
BBs that write no arithmetic flags : 9083381
BBs that write no arithmetic flags, end in ib : 2074250
Far direct links : 262679
Fragments requiring post_linkstub offs : 959769
Fragments deleted for any reason : 8787489
Fragments deleted on thread/process death : 82416
Fragments deleted for munmap or RO consistency : 8705073
IBT resizes : 204
BB fragments targeted by IBL : 1663190
Exits due to IBL cold misses : 2181615
Fragments regenerated or duplicated : 8703280
Recreated fragments, total : 320892
Recreations via app re-decode : 320890
Recreation spill value restores : 2
Branches linked, direct : 8446906
Branches linked, indirect : 2074225
Fcache exits, total : 10728303
Fcache exits, system call executions : 2513
Fcache exits, flushed due to code mod : 41801
Fcache exits, asynch : 56
Fcache exits, proactive reset : 160800
Fcache exits, from BBs : 10523133
Fcache exits, total indirect branches : 3844805
Fcache exits, non-trace indirect branches : 3844805
Fcache exits, ind target not in cache : 2181615
Fcache exits, ind target in cache but not table : 1663190
Fcache exits, from BB, ind target ... : 1663190
Fcache exits, BB->BB, ind target ... : 1663190
Fcache exits, dir target not in cache : 6605893
Fcache exits, link not allowed : 69570
Fcache exits, non-ignorable system call : 2864
Fcache exits, no link shared <-> private : 69570
Fcache exits, presumed link race condition : 1
Fcache bb peak capacity (bytes) : 303104
Fcache bb peak used (bytes) : 68644
Fcache shared bb peak capacity (bytes) : 8601600
Fcache shared bb space used (bytes) : -909305952
Fcache shared bb peak used (bytes) : 5003156
Fcache shared bb empty space (bytes) : 40787788
Fcache shared bb free coalesce prev : 46742
Fcache shared bb free coalesce next : 7596301
Fcache shared bb return last : 389245
Fcache shared bb free use larger bucket : 7672803
Fcache shared bb free split : 7561820
Peak fcache combined capacity (bytes) : 8904704
Peak fcache units on live list : 223
Peak fcache units on free list : 15
Fcache unit lookups : 26444866
Peak special heap units : 73
Peak special heap capacity (bytes) : 299008
Peak heap units on live list : 367
Peak heap units on free list : 18
Peak heap align space (bytes) : 165188
Peak heap bucket pad space (bytes) : 54984
Heap allocs in buckets : 203061461
Heap allocs variable-sized : 3043932
Total reserved memory : 131072
Peak total reserved memory : 54665216
Guard pages, reserved virtual pages : 4
Peak guard pages, reserved virtual pages : 1914
Current stack capacity (bytes) : 114688
Peak stack capacity (bytes) : 6635520
Mmap capacity (bytes) : -3895296
Peak mmap capacity (bytes) : 9801728
Mmap reserved but not committed (bytes) : -9809920
Peak mmap reserved but not committed (bytes) : 3948544
Current heap capacity (bytes) : 299008
Peak heap capacity (bytes) : 19300352
Heap reserved but not committed (bytes) : 2691072
Peak heap reserved but not committed (bytes) : 9412608
Current total memory from OS (bytes) : 131072
Peak total memory from OS (bytes) : 54665216
Current vmm blocks for unreachable heap : 192
Peak vmm blocks for unreachable heap : 6070
Current vmm blocks for stack : 32
Peak vmm blocks for stack : 1914
Peak vmm blocks for unreachable special heap : 876
Peak vmm blocks for unreachable special mmap : 511
Current vmm blocks for reachable heap : 16
Peak vmm blocks for reachable heap : 608
Peak vmm blocks for cache : 3568
Peak vmm blocks for reachable special mmap : 7
Our virtual memory blocks in use : 240
Peak our virtual memory blocks in use : 13554
Allocations using multiple vmm blocks : 963
Blocks used for multi-block allocs : 14078
Current vmm virtual memory in use (bytes) : 983040
Peak vmm virtual memory in use (bytes) : 55517184
Total times mutexes acquired : 387772089
Total times mutexes contended : 149674
Number of safe reads : 361704
Number of vmarea vector resize reallocations : 5
Peak vmarea vector length : 387
Peak dynamo areas vector length : 1
Peak executable areas vector length : 18
Bytes shared frags ever : 869128496
Bytes bb frags ever : 64716
Num synch yields : 103654
Num synch loops in wait_at_safe_spot : 850
Multiple setcontexts while in wait_at_safe_spot : 537
Load-exclusive instrs converted to CAS : 87216
Store-exclusive instrs converted to CAS : 87198
(End) All statistics
@AssadHashmi is there any ongoing work looking at AArch64 overhead? There seem to be several cases of significant overhead for DR on AArch64 compared to DR on x86: this issue, #4279.
There isn't any ongoing work on AArch64 overhead ATM.
heavyweight. @AssadHashmi -- could this be skipped for brand new code in a fresh mmap? Xref ARM machine_cache_sync is not thread-safe #1911.
Yes AFAICT but I don't have much experience in that area. I'll set aside time to have a proper look at it. This issue was raised a while ago and is still open: https://github.com/DynamoRIO/dynamorio/issues/2065
Definitely there are many more fragments deleted and re-recreated on AArch64. It seems to be deleting and re-generating over and over and over. x86:
Writable code regions we made read-only : 3296
Write faults on read-only code regions : 3300
Cache consistency flushes : 3299
Cache consistency flushes that flushed nothing : 4
Cache consistency non-code nop flushes : 8
Flushes that flushed >=1 shared fragment : 3295 <-- unlink flush only
Fragments generated, bb and trace : 167699
Fragments deleted for any reason : 167699
Fragments deleted for munmap or RO consistency : 73113
Fragments regenerated or duplicated : 71560
Fcache exits, proactive reset : 188
aarch64:
Cache consistency flushes : 41799
Cache consistency flushes via synchall : 41799
Cache consistency non-code nop flushes : 2
Fragments generated, bb and trace : 8787489
Fragments deleted for any reason : 8787489
Fragments deleted for munmap or RO consistency : 8705073
Fragments regenerated or duplicated : 8703280
Fcache exits, proactive reset : 160800 <--- includes flush synchall
Suggestions:
-hw_cache_consistency
. So we can turn it off and see the performance effect (even if that messes up app correctness).I compiled the ReadWrite.java from above on an a64 machine but did not have enough time to fully analyze it. Up front I hit:
<Undefined HINT instruction found: encoding 0xd503245f (CRm:op2 0x22)
filed as #5331 I killed it at -loglevel 3
after maybe a minute and at fist glance saw a whole bunch of empty synchall flushes: so not yet seeing the deletions in the stats above; maybe that comes later in a long run.
Not sure when I will have time to look at this again so if someone else wants to continue analyzing please do so.
just interesing thing. If I set -private_bb_ibl_targets_init to 16 (BB IBL tables initial size) when the fast 6 sec 40 mb sample takes more than 1 minute to finish.
try 7mb file for debug and release
release took 22 secs and Basic block fragments generated : 317478
debug took 16 secs and Basic block fragments generated : 120560
Continue our experiments on AArch64. We saw strange thing on io workloads. They take awful count of time. I compare the same run on x86 and it is fine.
ReadWrite.java the simplest one thread java application (MyFile.txt is text 350 MB file)
Original run without DynamoRIO took ~2 secs. x86 with DynamoRIO took about 3 secs BUT AArch64 with DRIO took 14 minutes (420x overhead) :((
perf hotspots looks
opcodes statistics
Also I've tried to check c++ native version on AArch64 and it looks fine
opcodes statistics
Maybe you have ideas what could be wrong here? Why do we have such difference between x86 and AArch64 on such java workloads? Thanks, Kirill