DynamoRIO / dynamorio

Dynamic Instrumentation Tool Platform
Other
2.67k stars 562 forks source link

[AArch64][jdk8] IO java workload has significant overhead without any clients. #5309

Open kuhanov opened 2 years ago

kuhanov commented 2 years ago

Continue our experiments on AArch64. We saw strange thing on io workloads. They take awful count of time. I compare the same run on x86 and it is fine.

ReadWrite.java the simplest one thread java application (MyFile.txt is text 350 MB file)

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.FileReader;
import java.io.IOException;

public class ReadWrite {
.
    public static void main(String[] args) {
        try {
            FileReader reader = new FileReader("MyFile.txt");
            FileWriter writer = new FileWriter("MyFile2.txt", true);

            BufferedReader bufferedReader = new BufferedReader(reader);
            BufferedWriter bufferedWriter = new BufferedWriter(writer);

            String line;
            while ((line = bufferedReader.readLine()) != null) {
                bufferedWriter.write(line);
                bufferedWriter.newLine();
            }
            reader.close();
            writer.close();

        } catch (IOException e) {
            e.printStackTrace();
        }
    }

}

Original run without DynamoRIO took ~2 secs. x86 with DynamoRIO took about 3 secs BUT AArch64 with DRIO took 14 minutes (420x overhead) :((

perf hotspots looks image

opcodes statistics

Top 15 opcode execution counts in 64-bit AArch64 mode:
   90371625 : movz
  109692827 : cbz
  119677065 : ldrb
  141852747 : ldp
  149865875 : stp
  150764846 : sub
  186207835 : ubfm
  257289620 : str
  311057147 : orr
  372462682 : ldrsb
  373357880 : strh
  379415317 : strb
  736110803 : ldrh
  760824735 : ldr
  2145521407 : subs
  2182566199 : bcond
  2473296771 : add

Also I've tried to check c++ native version on AArch64 and it looks fine

#include <iostream>
#include <fstream>
#include <string>
using namespace std;

int main () {
  string line;
  ifstream myfile ("MyFile.txt");
  ofstream myfile2 ("MyFile2.txt");
  if (myfile.is_open() && myfile2.is_open())
  {
    while ( getline (myfile,line) )
    {
      myfile2 <<  line << '\n';
    }
    myfile.close();
    myfile2.close();
  }

  else cout << "Unable to open file";

  return 0;
}

opcodes statistics

Top 15 opcode execution counts in 64-bit AArch64 mode:
   38881409 : sub
   41657277 : bl
   43253273 : cbz
   44942292 : cbnz
   48018122 : br
   54446025 : ret
   63293933 : ldur
   64014752 : movz
   64051241 : adrp
   99217398 : stp
  124038866 : bcond
  127033323 : subs
  127577743 : ldp
  148219600 : str
  211563739 : add
  222393285 : orr
  353538698 : ldr

Maybe you have ideas what could be wrong here? Why do we have such difference between x86 and AArch64 on such java workloads? Thanks, Kirill

derekbruening commented 2 years ago

Xref #4279

derekbruening commented 2 years ago

@AssadHashmi is there any ongoing work looking at AArch64 overhead? There seem to be several cases of significant overhead for DR on AArch64 compared to DR on x86: this issue, #4279.

derekbruening commented 2 years ago

Please run with -prof_pcs and get the category breakdown to see where time is spent. This is a critical step to understand the behavior.

The top functions in the profile image above:

But I would want to see the whereami breakdown from -prof_pcs first.

kuhanov commented 2 years ago
  • fragment_remove_from_ibt_tables: This would only be slow due to the lock it grabs? -shared_bb_ibt_tables is on by default: try -no_shared_bb_ibt_tables to get rid of the global lock?

test with -no_shared_bb_ibt_tables took 10m34.431s vs 14m29.876s.

kuhanov commented 2 years ago

But I would want to see the whereami breakdown from -prof_pcs first.

Decrease the size of MyFile.txt to 100mb. cmd

~/dynamorio_9.0.19012/.build//bin64/drrun -prof_pcs  -prof_pcs_heap_size 4G -vm_size 2G -vmheap_size 24G -- java -XX:-UseCompressedOops -XX:-UseCompressedClassPointers -XX:-TieredCompilation ReadWrite

result1:

DynamoRIO library: 0x0000000071000000-0x0000000071224000
ITIMER distribution (47883):
   17.3% of time in INTERPRETER (8281)
   80.7% of time in DISPATCH (38643)
    0.0% of time in MONITOR (4)
    0.0% of time in SYSCALL HANDLER (2)
    0.6% of time in SIGNAL HANDLER (290)
    0.3% of time in CONTEXT SWITCH (165)
    0.2% of time in INDIRECT BRANCH LOOKUP (111)
    0.8% of time in FRAGMENT CACHE (379)
    0.0% of time in UNKNOWN (8)

cmd

~/dynamorio_9.0.19012/.build//bin64/drrun -no_shared_bb_ibt_tables -prof_pcs   -prof_pcs_heap_size 4G -vm_size 2G -vmheap_size 24G -- java -XX:-UseCompressedOops -XX:-UseCompressedClassPointers -XX:-TieredCompilation ReadWrite

result2:

DynamoRIO library: 0x0000000071000000-0x0000000071224000
ITIMER distribution (47715):
   16.0% of time in INTERPRETER (7643)
   82.6% of time in DISPATCH (39421)
    0.0% of time in MONITOR (3)
    0.0% of time in SYSCALL HANDLER (1)
    0.1% of time in SIGNAL HANDLER (56)
    0.3% of time in CONTEXT SWITCH (143)
    0.2% of time in INDIRECT BRANCH LOOKUP (107)
    0.7% of time in FRAGMENT CACHE (337)
    0.0% of time in UNKNOWN (4)

result1-java.or.68074.00000000-100mb-180secs-pcsamples.0.68074.log.zip

result2-java.or.29080.00000000-100mb-180secs-no-shared-pcsamples.0.29080.log.zip

kuhanov commented 2 years ago

Another interesting this is I played with the size of input file for example, 20 mb file took 30secs, but 40mb took just 6 sec

attached prof_cps for such cases too java.or.17127.00000000-40mb-6sec.zip java.or.91437.00000000-20mb-30sec.zip

Kirill

derekbruening commented 2 years ago
   16.0% of time in INTERPRETER (7643)
   82.6% of time in DISPATCH (39421)

Looking like #4279: tons of time in dispatch; almost nothing in cache. I would say, like there, to try -thread_private to see the impact on removing the locks, but unfortunately nobody ever added that support to AArch64: #1884.

Would it be possible to get the pcsamples for x86? Does it build the same number of blocks, and has the same number of threads and same thread sharing behavior? How many blocks are built (-rstats_to_stderr)?

kuhanov commented 2 years ago

How many blocks are built (-rstats_to_stderr)?

x86

DynamoRIO statistics:
              Peak threads under DynamoRIO control :                20
                              Threads ever created :                20
                           Total signals delivered :                 3
                                 System calls, pre :               752
                                System calls, post :               642
                                 Application mmaps :               191
                               Application munmaps :                27
        Synch attempt failure b/c not at safe spot :                64
                   Basic block fragments generated :            188184
             Peak fcache combined capacity (bytes) :           6344704
                    Peak fcache units on live list :               130
                    Peak fcache units on free list :                 5
                Peak special heap capacity (bytes) :           1994752
                      Peak heap units on live list :               161
                      Peak heap units on free list :                 5
                       Peak stack capacity (bytes) :           1859584
                        Peak heap capacity (bytes) :          20021248
                 Peak total memory from OS (bytes) :          35102720
              Peak vmm blocks for unreachable heap :              5171
                         Peak vmm blocks for stack :               536
      Peak vmm blocks for unreachable special heap :               120
      Peak vmm blocks for unreachable special mmap :               140
                Peak vmm blocks for reachable heap :               184
                         Peak vmm blocks for cache :              2080
        Peak vmm blocks for reachable special heap :               540
        Peak vmm blocks for reachable special mmap :                 7
            Peak vmm virtual memory in use (bytes) :          35954688

AArch64

DynamoRIO statistics:
              Peak threads under DynamoRIO control :                73
                              Threads ever created :                73
                           Total signals delivered :                32
                                 System calls, pre :              2064
                                System calls, post :              1741
                                 Application mmaps :               482
                               Application munmaps :                56
        Synch attempt failure b/c not at safe spot :               161
                   Basic block fragments generated :           5466007
             Peak fcache combined capacity (bytes) :           8933376
                    Peak fcache units on live list :               224
                    Peak fcache units on free list :                15
                Peak special heap capacity (bytes) :            299008
                      Peak heap units on live list :               363
                      Peak heap units on free list :                18
                       Peak stack capacity (bytes) :           6635520
                        Peak heap capacity (bytes) :          18182144
                 Peak total memory from OS (bytes) :          53592064
              Peak vmm blocks for unreachable heap :              5792
                         Peak vmm blocks for stack :              1914
      Peak vmm blocks for unreachable special heap :               876
      Peak vmm blocks for unreachable special mmap :               511
                Peak vmm blocks for reachable heap :               608
                         Peak vmm blocks for cache :              3584
        Peak vmm blocks for reachable special mmap :                 7
            Peak vmm virtual memory in use (bytes) :          54444032
            Load-exclusive instrs converted to CAS :             56721
           Store-exclusive instrs converted to CAS :             56686
kuhanov commented 2 years ago

Would it be possible to get the pcsamples for x86?

DynamoRIO library: 0x00007fd9c3a0d000-0x00007fd9c3c29000
ITIMER distribution (350):
25.1% of time in INTERPRETER (88)
34.9% of time in DISPATCH (122)
0.9% of time in SIGNAL HANDLER (3)
12.3% of time in CONTEXT SWITCH (43)
7.4% of time in INDIRECT BRANCH LOOKUP (26)
19.4% of time in FRAGMENT CACHE (68)

full log java.or.2017190.00000000-x86-100mb.zip

derekbruening commented 2 years ago

Wow:

                   Basic block fragments generated :            188184   <------- x86
                   Basic block fragments generated :           5466007   <------- aarch64

That's the same workload? a64 built 5 1/2 million blocks while x86 built <200K? Unfortunately these release-build stats don't have # of fragments deleted. But a64 should have more precise JIT handling, seeing cache flush requests as opposed to x86 where DR has to guess and act on page granularity and can't distinguish appending new JIT code to the end of a region from changing existing JIT code on that page.

Plus, x86 has traces, which should result in a little extra count of blocks created, while traces are not implement on a64.

Maybe the thing to do is try to get debug build stats for these 2 workloads? Is DR on a64 really doing 30x as many cache flushes as on x86, maybe due to tons of hardware cache flush requests from the JIT or something??

@AssadHashmi wondering if you have any insights here.

kuhanov commented 2 years ago

That's the same workload?

the same workload with the same MyFile.txt (~100mb)

kuhanov commented 2 years ago

Plus, x86 has traces, which should result in a little extra count of blocks created, while traces are not implement on a64.

we don't use traces for x86 too (-disable_traces option)

kuhanov commented 2 years ago

Maybe the thing to do is try to get debug build stats for these 2 workloads?

this one? -debug -rstats_to_stderr thre are no difference in output

DynamoRIO statistics:
              Peak threads under DynamoRIO control :                20
                              Threads ever created :                20
                           Total signals delivered :                 3
                                 System calls, pre :               769
                                System calls, post :               661
                                 Application mmaps :               193
                               Application munmaps :                29
        Synch attempt failure b/c not at safe spot :                46
                   Basic block fragments generated :            189119
             Peak fcache combined capacity (bytes) :           6369280
                    Peak fcache units on live list :               130
                    Peak fcache units on free list :                 5
                Peak special heap capacity (bytes) :           1990656
                      Peak heap units on live list :               170
                      Peak heap units on free list :                 5
                       Peak stack capacity (bytes) :           1859584
                        Peak heap capacity (bytes) :          20914176
                 Peak total memory from OS (bytes) :          36143104
              Peak vmm blocks for unreachable heap :              5425
                         Peak vmm blocks for stack :               536
      Peak vmm blocks for unreachable special heap :               120
      Peak vmm blocks for unreachable special mmap :               140
                Peak vmm blocks for reachable heap :               184
                         Peak vmm blocks for cache :              2080
        Peak vmm blocks for reachable special heap :               540
        Peak vmm blocks for reachable special mmap :                 7
            Peak vmm virtual memory in use (bytes) :          36995072
derekbruening commented 2 years ago

The debug ones go to the logfile. To get them to the console this would work: -debug -loglevel 1 -log_to_stderr -logmask 0x1. But there are probably several sets of them; we'd want just the last set.

kuhanov commented 2 years ago

x86

(Begin) All statistics @167699 (0:42.127):
              Peak threads under DynamoRIO control :                20
                              Threads ever created :                20
                           Total signals delivered :                 5
                                 System calls, pre :               790
                                System calls, post :               692
                      System calls, pre, ignorable :                67
                     System calls, post, ignorable :                66
                            Ignorable system calls :               481
                        Non-ignorable system calls :                32
                                 Application mmaps :               193
                               Application munmaps :                27
                     Application modules with code :                20
                     Application code seen (bytes) :           2293886
            Interpreted calls, direct and indirect :             32684
                        Interpreted indirect calls :              4320
                         Interpreted indirect jmps :             14435
                                  Interpreted rets :             14433
                   Dynamic option synchronizations :                 1
        Dynamic option synchronizations, no change :                 1
                     Code origin addresses checked :            171485
                Code origin addresses in last area :            155421
                             Writable code regions :              3296
           Writable code regions we made read-only :              3296
            Write faults on read-only code regions :              3300
                                 Write fault races :                 1
                         Cache consistency flushes :              3299
    Cache consistency flushes that flushed nothing :                 4
        Synch attempt failure b/c not at safe spot :                41
            Cache consistency non-code nop flushes :                 8
          Flushes that flushed >=1 shared fragment :              3295
                  Shared deletion regions unlinked :              6590
                      Shared deletion region walks :             49725
           Shared deletion region at-syscall walks :             44022
              Shared deletion ref count decrements :             52149
         Shared deletion flushtime diff > #threads :              1027
                Shared deletion max flushtime diff :                36
                       Shared deletion max pending :                35
            Shared deletion region removals: ref 0 :              1265
         Shared deletion region removals: at reset :              2030
              Resets of entire fcache, proactively :                58
          Resets due to too many pending deletions :                58
                 Generated code protection changes :             13199
                           Protection change calls :             19798
                           Protection change pages :           4139743
                 Fragments generated, bb and trace :            167699
                   Basic block fragments generated :            167699
                        Future fragments generated :            159585
                        Shared fragments generated :            167527
                              Shared bbs generated :            167527
                       Private fragments generated :               172
                             Private bbs generated :               172
                 Shared future fragments generated :            159241
                        Unique fragments generated :             96139
          Maximum fragment requested size in bytes :             10051
                    Maximum fragment size in bytes :             10048
                            Maximum instrs in a bb :               257
                 BBs truncated due to instr limits :                24
                         Direct exit stubs created :            226129
                       Indirect exit stubs created :             32028
                            Separate stubs created :            135499
                          Rip-relative instrs seen :             30767
                     Rip-relative unreachable leas :              4642
              Rip-relative unreachable leas in app :                 4
                 Rip-relative unreachable non-leas :             26125
          Rip-relative unreachable non-leas in app :                12
           Rip-relative unreachable spills avoided :             12522
                        BBs with one indirect exit :             32028
                          BBs with one direct exit :             45213
                         BBs with two direct exits :             90458
       BB direct exits >SHRT_MAX from fragment tag :             22907
      BB direct exits <=SHRT_MAX from fragment tag :            203222
  BB cbr fall-through <=SHRT_MAX from fragment tag :             90458
           BBs using post-linkstub fragment offset :             45385
   BBs that write OF but no other arithmetic flags :            122849
           BBs that read a flag before writing any :               301
                BBs that write no arithmetic flags :             48163
     BBs that write no arithmetic flags, end in ib :             16044
                   Cbrs sharing a single exit stub :             90286
            Fragments requiring post_linkstub offs :             45385
                   Fragments unlinked for flushing :             73113
                  Fragments deleted for any reason :            167699
                   Fragments unlinked for deletion :             73113
         Fragments deleted on thread/process death :             94586
    Fragments deleted for munmap or RO consistency :             73113
                  IBTs replaced unlinked fragments :             11977
                                       IBT resizes :               133
                       Same-size IBT table resizes :                16
                      BB fragments targeted by IBL :             51165
                      Exits due to IBL cold misses :             54212
               Fragments regenerated or duplicated :             71560
                        Recreated fragments, total :              3614
                     Recreations via app re-decode :              3614
                   Recreation spill value restores :                 2
                           Branches linked, direct :            167312
                         Branches linked, indirect :             32028
                             Waits due to flushing :              1811
                               Fcache exits, total :            298850
              Fcache exits, system call executions :               687
             Fcache exits, flushed due to code mod :              3300
                              Fcache exits, asynch :                10
                     Fcache exits, proactive reset :               188
       Fcache entrances aborted b/c target missing :              1467
                            Fcache exits, from BBs :            294665
             Fcache exits, total indirect branches :            105377
         Fcache exits, non-trace indirect branches :            105377
             Fcache exits, ind target not in cache :             54211
   Fcache exits, ind target in cache but not table :             51166
             Fcache exits, from BB, ind target ... :             51166
              Fcache exits, BB->BB, ind target ... :             51166
             Fcache exits, dir target not in cache :            112805
                    Fcache exits, link not allowed :             75693
           Fcache exits, non-ignorable system call :               790
          Fcache exits, no link shared <-> private :             75693
           Fcache exits needing cbr disambiguation :             96439
                  Fragments with OF restore prefix :             47571
                   Fcache bb peak capacity (bytes) :             81920
                       Fcache bb peak used (bytes) :             18188
                     Fcache bb align space (bytes) :               -25
            Fcache shared bb peak capacity (bytes) :           6361088
               Fcache shared bb space used (bytes) :          -4266776
                Fcache shared bb peak used (bytes) :           3193060
              Fcache shared bb align space (bytes) :              8901
              Fcache shared bb empty space (bytes) :            174756
               Fcache shared bb free coalesce prev :             49899
               Fcache shared bb free coalesce next :            149325
                      Fcache shared bb return last :              2013
           Fcache shared bb free use larger bucket :             59926
                       Fcache shared bb free split :             56916
             Peak fcache combined capacity (bytes) :           6443008
                    Peak fcache units on live list :               131
                    Peak fcache units on free list :                 5
                               Fcache unit lookups :           1846284
                           Peak special heap units :                34
                  Special heap align space (bytes) :               160
                Peak special heap capacity (bytes) :           2019328
                      Peak heap units on live list :               171
                      Peak heap units on free list :                 5
                     Peak heap align space (bytes) :            193331
                Peak heap bucket pad space (bytes) :             34824
                            Heap allocs in buckets :           2379469
                        Heap allocs variable-sized :              1474
                             Total reserved memory :            131072
                        Peak total reserved memory :          36642816
               Guard pages, reserved virtual pages :                 4
          Peak guard pages, reserved virtual pages :               836
                    Current stack capacity (bytes) :            114688
                       Peak stack capacity (bytes) :           1859584
                             Mmap capacity (bytes) :          -1077248
                        Peak mmap capacity (bytes) :           6672384
           Mmap reserved but not committed (bytes) :          -6701056
      Peak mmap reserved but not committed (bytes) :           1126400
                         Peak heap claimed (bytes) :          18971153
                     Current heap capacity (bytes) :           2031616
                        Peak heap capacity (bytes) :          21381120
           Heap reserved but not committed (bytes) :            393216
      Peak heap reserved but not committed (bytes) :           4190208
              Current total memory from OS (bytes) :            131072
                 Peak total memory from OS (bytes) :          36642816
           Current vmm blocks for unreachable heap :               192
              Peak vmm blocks for unreachable heap :              5531
                      Current vmm blocks for stack :                32
                         Peak vmm blocks for stack :               536
      Peak vmm blocks for unreachable special heap :               120
      Peak vmm blocks for unreachable special mmap :               140
             Current vmm blocks for reachable heap :                16
                Peak vmm blocks for reachable heap :               184
                         Peak vmm blocks for cache :              2096
        Peak vmm blocks for reachable special heap :               540
        Peak vmm blocks for reachable special mmap :                 7
                  Our virtual memory blocks in use :               240
             Peak our virtual memory blocks in use :              9154
             Allocations using multiple vmm blocks :               426
                Blocks used for multi-block allocs :              9812
         Current vmm virtual memory in use (bytes) :            983040
            Peak vmm virtual memory in use (bytes) :          37494784
                      Total times mutexes acquired :           9404971
                     Total times mutexes contended :            922572
                              Number of safe reads :            208839
      Number of vmarea vector resize reallocations :                 2
                         Peak vmarea vector length :               195
                   Peak dynamo areas vector length :                 1
               Peak executable areas vector length :                21
            -pad_jmps fragments size overestimated :            126276
  -pad_jmps excess instances coalesced w/ nxt free :             43779
  -pad_jmps excess instances failed to be returned :              9179
      -pad_jmps excess bytes failed to be returned :            104176
                    -pad_jmps body bytes shared bb :            502581
                  -pad_jmps excess bytes shared bb :            477335
                           Bytes shared frags ever :           8812542
             -pad_jmps start_pcs shifted shared bb :             12841
       -pad_jmps start_pcs shifted bytes shared bb :             25246
         -pad_jmps excess bytes released shared bb :            486236
                  -pad_jmps no pad exits shared bb :            244972
                           -pad_jmps body bytes bb :               516
                         -pad_jmps excess bytes bb :               505
                               Bytes bb frags ever :             16852
                    -pad_jmps start_pcs shifted bb :                 5
              -pad_jmps start_pcs shifted bytes bb :                11
                -pad_jmps excess bytes released bb :               480
                       -pad_jmps no shift stubs bb :               344
                         -pad_jmps no pad exits bb :               339
                                  Num synch yields :                39
              Num synch loops in wait_at_safe_spot :                81
        App reference with FS/GS seg being mangled :               497
(End) All statistics

AArch64

(Begin) All statistics @8787489 (6:40.937):
              Peak threads under DynamoRIO control :                73
                              Threads ever created :                73
                           Total signals delivered :                28
                                 System calls, pre :              2864
                                System calls, post :              2541
                      System calls, pre, ignorable :               551
                     System calls, post, ignorable :               484
                            Ignorable system calls :            321320
                        Non-ignorable system calls :                31
                                 Application mmaps :               482
                               Application munmaps :                66
                     Application modules with code :                17
                     Application code seen (bytes) :           1841476
            Interpreted calls, direct and indirect :            323937
                        Interpreted indirect calls :             85636
                         Interpreted indirect jmps :           1562310
                                  Interpreted rets :            426312
                   Dynamic option synchronizations :                 1
        Dynamic option synchronizations, no change :                 1
                     Code origin addresses checked :           9108923
                Code origin addresses in last area :           9044136
                             Writable code regions :                 1
                         Cache consistency flushes :             41799
            Cache consistency flushes via synchall :             41799
    Thread not translated in synchall flush (race) :               153
        Synch attempt failure b/c not at safe spot :               386
            Cache consistency non-code nop flushes :                 2
                      Shared deletion region walks :                 1
                 Generated code protection changes :                 3
                           Protection change calls :                 9
                           Protection change pages :                87
                 Fragments generated, bb and trace :           8787489
                   Basic block fragments generated :           8787489
                        Future fragments generated :           8783677
                        Shared fragments generated :           8786947
                              Shared bbs generated :           8786947
                       Private fragments generated :               542
                             Private bbs generated :               542
                 Shared future fragments generated :           8782593
                        Unique fragments generated :             84209
          Maximum fragment requested size in bytes :              1608
                    Maximum fragment size in bytes :              1608
                            Maximum instrs in a bb :               257
                 BBs truncated due to instr limits :                33
                         Direct exit stubs created :          12467301
                       Indirect exit stubs created :           2074225
                        BBs with one indirect exit :           2074225
                          BBs with one direct exit :            959227
                         BBs with two direct exits :           5754037
       BB direct exits >SHRT_MAX from fragment tag :            179784
      BB direct exits <=SHRT_MAX from fragment tag :          12287517
  BB cbr fall-through <=SHRT_MAX from fragment tag :           5754037
           BBs using post-linkstub fragment offset :            959769
           BBs that read a flag before writing any :             25000
                BBs that write no arithmetic flags :           9083381
     BBs that write no arithmetic flags, end in ib :           2074250
                                  Far direct links :            262679
            Fragments requiring post_linkstub offs :            959769
                  Fragments deleted for any reason :           8787489
         Fragments deleted on thread/process death :             82416
    Fragments deleted for munmap or RO consistency :           8705073
                                       IBT resizes :               204
                      BB fragments targeted by IBL :           1663190
                      Exits due to IBL cold misses :           2181615
               Fragments regenerated or duplicated :           8703280
                        Recreated fragments, total :            320892
                     Recreations via app re-decode :            320890
                   Recreation spill value restores :                 2
                           Branches linked, direct :           8446906
                         Branches linked, indirect :           2074225
                               Fcache exits, total :          10728303
              Fcache exits, system call executions :              2513
             Fcache exits, flushed due to code mod :             41801
                              Fcache exits, asynch :                56
                     Fcache exits, proactive reset :            160800
                            Fcache exits, from BBs :          10523133
             Fcache exits, total indirect branches :           3844805
         Fcache exits, non-trace indirect branches :           3844805
             Fcache exits, ind target not in cache :           2181615
   Fcache exits, ind target in cache but not table :           1663190
             Fcache exits, from BB, ind target ... :           1663190
              Fcache exits, BB->BB, ind target ... :           1663190
             Fcache exits, dir target not in cache :           6605893
                    Fcache exits, link not allowed :             69570
           Fcache exits, non-ignorable system call :              2864
          Fcache exits, no link shared <-> private :             69570
        Fcache exits, presumed link race condition :                 1
                   Fcache bb peak capacity (bytes) :            303104
                       Fcache bb peak used (bytes) :             68644
            Fcache shared bb peak capacity (bytes) :           8601600
               Fcache shared bb space used (bytes) :        -909305952
                Fcache shared bb peak used (bytes) :           5003156
              Fcache shared bb empty space (bytes) :          40787788
               Fcache shared bb free coalesce prev :             46742
               Fcache shared bb free coalesce next :           7596301
                      Fcache shared bb return last :            389245
           Fcache shared bb free use larger bucket :           7672803
                       Fcache shared bb free split :           7561820
             Peak fcache combined capacity (bytes) :           8904704
                    Peak fcache units on live list :               223
                    Peak fcache units on free list :                15
                               Fcache unit lookups :          26444866
                           Peak special heap units :                73
                Peak special heap capacity (bytes) :            299008
                      Peak heap units on live list :               367
                      Peak heap units on free list :                18
                     Peak heap align space (bytes) :            165188
                Peak heap bucket pad space (bytes) :             54984
                            Heap allocs in buckets :         203061461
                        Heap allocs variable-sized :           3043932
                             Total reserved memory :            131072
                        Peak total reserved memory :          54665216
               Guard pages, reserved virtual pages :                 4
          Peak guard pages, reserved virtual pages :              1914
                    Current stack capacity (bytes) :            114688
                       Peak stack capacity (bytes) :           6635520
                             Mmap capacity (bytes) :          -3895296
                        Peak mmap capacity (bytes) :           9801728
           Mmap reserved but not committed (bytes) :          -9809920
      Peak mmap reserved but not committed (bytes) :           3948544
                     Current heap capacity (bytes) :            299008
                        Peak heap capacity (bytes) :          19300352
           Heap reserved but not committed (bytes) :           2691072
      Peak heap reserved but not committed (bytes) :           9412608
              Current total memory from OS (bytes) :            131072
                 Peak total memory from OS (bytes) :          54665216
           Current vmm blocks for unreachable heap :               192
              Peak vmm blocks for unreachable heap :              6070
                      Current vmm blocks for stack :                32
                         Peak vmm blocks for stack :              1914
      Peak vmm blocks for unreachable special heap :               876
      Peak vmm blocks for unreachable special mmap :               511
             Current vmm blocks for reachable heap :                16
                Peak vmm blocks for reachable heap :               608
                         Peak vmm blocks for cache :              3568
        Peak vmm blocks for reachable special mmap :                 7
                  Our virtual memory blocks in use :               240
             Peak our virtual memory blocks in use :             13554
             Allocations using multiple vmm blocks :               963
                Blocks used for multi-block allocs :             14078
         Current vmm virtual memory in use (bytes) :            983040
            Peak vmm virtual memory in use (bytes) :          55517184
                      Total times mutexes acquired :         387772089
                     Total times mutexes contended :            149674
                              Number of safe reads :            361704
      Number of vmarea vector resize reallocations :                 5
                         Peak vmarea vector length :               387
                   Peak dynamo areas vector length :                 1
               Peak executable areas vector length :                18
                           Bytes shared frags ever :         869128496
                               Bytes bb frags ever :             64716
                                  Num synch yields :            103654
              Num synch loops in wait_at_safe_spot :               850
   Multiple setcontexts while in wait_at_safe_spot :               537
            Load-exclusive instrs converted to CAS :             87216
           Store-exclusive instrs converted to CAS :             87198
(End) All statistics
AssadHashmi commented 2 years ago

@AssadHashmi is there any ongoing work looking at AArch64 overhead? There seem to be several cases of significant overhead for DR on AArch64 compared to DR on x86: this issue, #4279.

There isn't any ongoing work on AArch64 overhead ATM.

AssadHashmi commented 2 years ago

heavyweight. @AssadHashmi -- could this be skipped for brand new code in a fresh mmap? Xref ARM machine_cache_sync is not thread-safe #1911.

Yes AFAICT but I don't have much experience in that area. I'll set aside time to have a proper look at it. This issue was raised a while ago and is still open: https://github.com/DynamoRIO/dynamorio/issues/2065

derekbruening commented 2 years ago

Definitely there are many more fragments deleted and re-recreated on AArch64. It seems to be deleting and re-generating over and over and over. x86:

           Writable code regions we made read-only :              3296
            Write faults on read-only code regions :              3300
                         Cache consistency flushes :              3299
    Cache consistency flushes that flushed nothing :                 4
            Cache consistency non-code nop flushes :                 8
          Flushes that flushed >=1 shared fragment :              3295  <-- unlink flush only
                 Fragments generated, bb and trace :            167699   
                  Fragments deleted for any reason :            167699
    Fragments deleted for munmap or RO consistency :             73113
               Fragments regenerated or duplicated :             71560
                     Fcache exits, proactive reset :               188

aarch64:

                         Cache consistency flushes :             41799
            Cache consistency flushes via synchall :             41799
            Cache consistency non-code nop flushes :                 2
                 Fragments generated, bb and trace :           8787489
                  Fragments deleted for any reason :           8787489
    Fragments deleted for munmap or RO consistency :           8705073
               Fragments regenerated or duplicated :           8703280
                     Fcache exits, proactive reset :            160800   <--- includes flush synchall

Suggestions:

I compiled the ReadWrite.java from above on an a64 machine but did not have enough time to fully analyze it. Up front I hit:

I killed it at -loglevel 3 after maybe a minute and at fist glance saw a whole bunch of empty synchall flushes: so not yet seeing the deletions in the stats above; maybe that comes later in a long run.

Not sure when I will have time to look at this again so if someone else wants to continue analyzing please do so.

kuhanov commented 2 years ago

just interesing thing. If I set -private_bb_ibl_targets_init to 16 (BB IBL tables initial size) when the fast 6 sec 40 mb sample takes more than 1 minute to finish.

kuhanov commented 2 years ago

try 7mb file for debug and release release took 22 secs and Basic block fragments generated : 317478 debug took 16 secs and Basic block fragments generated : 120560