Investigation into why footprint increases with number of compilation threads

mpirvu commented 5 years ago

I observed that footprint during the steady state of AcmeAir increases with the number of compilation threads. This is unexpected because, once a compilation is over, we free all scratch memory. Moreover, persistent memory should not depend on the number of compilation threads. If we understand this issue we might be able to reduce footprint somewhat. First step would be to determine what kind of memory increases when more compilation threads are used.

DanHeidinga commented 5 years ago

As you know, there's a certain fixed overhead for any thread (native stack, java stack, J9VMThread structures, etc). Is the increase above and beyond that fixed overhead?

mpirvu commented 5 years ago

Some stats 1 comp thread ==> 221080 KB 2 comp threads ==> 223417 KB 4 comp threads ==> 228234 KB 7 comp threads ==> 234520 KB

Roughly 2MB for each compilation thread. From my previous experience the native stack does not contribute that much to RSS and Java stack for a compilation thread should be negligible.

mpirvu commented 5 years ago

I collected smaps, javacores and coredumps for 2 configurations: 1) one compilation thread 2) seven compilation threads

The javacore data shows 6 MB of more virtual memory being used by the 7-compThread config and that difference comes from the native stack. This is correct because compilation threads have 1MB of stack and we have 6 extra compilation threads.

1MEMUSER       JRE: 849,435,232 bytes / 12629 allocations
2MEMUSER       +--VM: 546,750,696 bytes / 6890 allocations
3MEMUSER       |  +--Threads: 25,424,808 bytes / 485 allocations
4MEMUSER       |  |  +--Java Stack: 1,979,352 bytes / 80 allocations
4MEMUSER       |  |  +--Native Stack: 22,675,456 bytes / 81 allocations
4MEMUSER       |  |  +--Other: 770,000 bytes / 324 allocations

1MEMUSER       JRE: 854,759,824 bytes / 12876 allocations
2MEMUSER       +--VM: 552,988,592 bytes / 7044 allocations
3MEMUSER       |  +--Threads: 31,825,104 bytes / 503 allocations
4MEMUSER       |  |  +--Java Stack: 2,102,096 bytes / 86 allocations
4MEMUSER       |  |  +--Native Stack: 28,966,912 bytes / 87 allocations
4MEMUSER       |  |  +--Other: 756,096 bytes / 330 allocations

However, what we care is the resident set size. I wrote a program that matches the information from smaps with information from javacores and coredump and the results are below: 1-compThread

Totals:       Virtual=  5024184 KB; RSS=   226000 KB
    GC heap:  Virtual=    99072 KB; RSS=    98660 KB
  CodeCache:  Virtual=   262144 KB; RSS=    10240 KB
  DataCache:  Virtual=     6144 KB; RSS=     5934 KB
        DLL:  Virtual=   119536 KB; RSS=    19616 KB
      Stack:  Virtual=    30280 KB; RSS=     3328 KB
        SCC:  Virtual=        0 KB; RSS=        0 KB
 JITScratch:  Virtual=        0 KB; RSS=        0 KB
 JITPersist:  Virtual=    12288 KB; RSS=    11738 KB
   Internal:  Virtual=        0 KB; RSS=        0 KB
    Classes:  Virtual=    46636 KB; RSS=    32573 KB
  CallSites:  Virtual=    26917 KB; RSS=    23764 KB
    Unknown:  Virtual=    39328 KB; RSS=    17803 KB
Not covered:  Virtual=  4378052 KB; RSS=     2340 KB

7 compThreads

Totals:       Virtual=  5032404 KB; RSS=   239892 KB
    GC heap:  Virtual=    99008 KB; RSS=    98652 KB
  CodeCache:  Virtual=   262144 KB; RSS=    10372 KB
  DataCache:  Virtual=     6144 KB; RSS=     5938 KB
        DLL:  Virtual=   119536 KB; RSS=    19996 KB
      Stack:  Virtual=    36448 KB; RSS=     2887 KB
        SCC:  Virtual=        0 KB; RSS=        0 KB
 JITScratch:  Virtual=        0 KB; RSS=        0 KB
 JITPersist:  Virtual=    11264 KB; RSS=    11112 KB
   Internal:  Virtual=        0 KB; RSS=        0 KB
    Classes:  Virtual=    46508 KB; RSS=    32426 KB
  CallSites:  Virtual=    27275 KB; RSS=    34936 KB
    Unknown:  Virtual=    42447 KB; RSS=    21233 KB
Not covered:  Virtual=  4367428 KB; RSS=     2336 KB

As one can see, the 7-compThreads configuration uses 13 MB more RSS and the difference does not come from the stack, but rather from the CallSites category. Indeed, if we look at how the smaps are covered by stack regions we realize that stacks contribute little to RSS. Some examples from the 7-compThread config

MemEntry: Start=00007f99d85ee000 End=00007f99d86ee000 Size=  1024 rss=    32 Prot=rw-p
    Covering segments/call-sites:
         ThreadName="JIT Compilation Thread-006 Suspended" Start=00007f99d85ee000 End=00007f99d86ee000 size= 1024 KB

MemEntry: Start=00007f99d86ef000 End=00007f99d87ef000 Size=  1024 rss=    32 Prot=rw-p
    Covering segments/call-sites:
         ThreadName="JIT Compilation Thread-005" Start=00007f99d86ef000 End=00007f99d87ef000 size= 1024 KB

MemEntry: Start=00007f99d87f0000 End=00007f99d88f0000 Size=  1024 rss=    32 Prot=rw-p
    Covering segments/call-sites:
         ThreadName="JIT Compilation Thread-004 Suspended" Start=00007f99d87f0000 End=00007f99d88f0000 size= 1024 KB

MemEntry: Start=00007f99d88f1000 End=00007f99d89f1000 Size=  1024 rss=    32 Prot=rw-p
    Covering segments/call-sites:
         ThreadName="JIT Compilation Thread-003 Suspended" Start=00007f99d88f1000 End=00007f99d89f1000 size= 1024 KB

MemEntry: Start=00007f99d89f2000 End=00007f99d8af2000 Size=  1024 rss=    32 Prot=rw-p
    Covering segments/call-sites:
         ThreadName="JIT Compilation Thread-002 Suspended" Start=00007f99d89f2000 End=00007f99d8af2000 size= 1024 KB
MemEntry: Start=00007f99d8af2000 End=00007f99d8af3000 Size=     4 rss=     0 Prot=---p

MemEntry: Start=00007f99d8af3000 End=00007f99d8bf3000 Size=  1024 rss=    32 Prot=rw-p
    Covering segments/call-sites:
         ThreadName="JIT Compilation Thread-001 Suspended" Start=00007f99d8af3000 End=00007f99d8bf3000 size= 1024 KB

MemEntry: Start=00007f99d9bf4000 End=00007f99d9cf4000 Size=  1024 rss=    32 Prot=rw-p
    Covering segments/call-sites:
         ThreadName="JIT Compilation Thread-000 Suspended" Start=00007f99d9bf4000 End=00007f99d9cf4000 size= 1024 KB

mpirvu commented 5 years ago

With DDR we can print the allocations from the callsites sorted by total allocation size:

1-compThread

 total alloc   | largest
 blocks| bytes | bytes | callsite
-------+-------+-------+-------+-------+-------+-------+-------+-------+-------
    758 48741720 2097152 segment.c:238
   1109 17885368   32768 segment.c:233
    278 5548880   32768 CL:326
      1 3145728 3145728 CL:186
      3 2478048  826016 CopyScanCacheChunk.cpp:38
     80 1974232   45248 vmthread.c:1378
   1018 1693504   38632 /home/mpirvu/JITaaS/openj9/runtime/compiler/../compiler/runtime/MethodMetaData.c:149
    290 1178560    4064 zipcache.c:879
    157  873216    8208 ../common/unsafe_mem.c:241
    154  794352  276336 CL:671
    190  720704   75128 StringTable.cpp:88
     82  680272    8296 trclog.c:1022
     82  655456    8192 ConfigurationStandard.cpp:273
      1  655360  655360 BufferManager.cpp:41
      1  651864  651864 jvminit.c:6432
      5  541920  108384 WorkPackets.cpp:179
      1  524288  524288 ClassFileParser.cpp:78
.............

7-compThreads

 total alloc   | largest
 blocks| bytes | bytes | callsite
-------+-------+-------+-------+-------+-------+-------+-------+-------+-------
    756 47564416 2097152 segment.c:238
   1111 17882856   32768 segment.c:233
    278 5548880   32768 CL:326
      1 3145728 3145728 CL:186
      3 2478048  826016 CopyScanCacheChunk.cpp:38
     86 2096592   45280 vmthread.c:1378
   1066 1763560   27136 /home/mpirvu/JITaaS/openj9/runtime/compiler/../compiler/runtime/MethodMetaData.c:149
    290 1178560    4064 zipcache.c:879
    157  873216    8208 ../common/unsafe_mem.c:241
    154  789136  276336 CL:671
     88  730048    8296 trclog.c:1022
    190  720704   75128 StringTable.cpp:88
     88  704608    8192 ConfigurationStandard.cpp:273
      1  655360  655360 BufferManager.cpp:41
      1  651864  651864 jvminit.c:6432
      5  541920  108384 WorkPackets.cpp:179
      1  524288  524288 ClassFileParser.cpp:78
.............

I wrote a script to compare the allocations, but I am way short of the 13MB I am looking for. As seen below, the biggest difference comes from vmthread.c:1378 but that accounts for only 122360 bytes.

('vmthread.c:121', -17280)
('jswalk.c:1611', -20480)
('vmthread.c:228', -25152)
('jithash.cpp:273', -32776)
('ConfigurationStandard.cpp:273', -49152)
('trclog.c:1022', -49776)
('/home/mpirvu/JITaaS/openj9/runtime/compiler/../compiler/runtime/MethodMetaData.c:149', -70056)
('vmthread.c:1378', -122360)

AlexeyKhrabrov commented 3 years ago

As discussed in #12659, most of this per-thread memory footprint comes from the malloc heap. Data reported by malloc_stats in my experiments is consistent with what is reported here: 1-2 MB per server thread. This is the memory that malloc "holds on to" (does not release to the OS) in per-thread arenas. In order to avoid this additional memory footprint, we need to stop using std::string and std::vector with default allocators in JITServer message serialization/deserialization, client session data, etc., and use scratch or persistent allocations instead.

dsouzai commented 3 years ago

we need to stop using std::string

There was a bug in the STL (I don't know exactly which version of STL, probably the one that corresponds with gcc 4.8) where you couldn't use std::string with a custom allocator because somewhere in its internals, it would try to instantiate the custom allocator via the allocator's default constructor; however, TR::typed_allocator uses explicit for its constructors so you'd end up with a build error.

Given that we're on gcc 7+ now, it might not be a problem anymore.

eclipse-openj9 / openj9

Investigation into why footprint increases with number of compilation threads #4299