Open mpirvu opened 5 years ago
As you know, there's a certain fixed overhead for any thread (native stack, java stack, J9VMThread structures, etc). Is the increase above and beyond that fixed overhead?
Some stats 1 comp thread ==> 221080 KB 2 comp threads ==> 223417 KB 4 comp threads ==> 228234 KB 7 comp threads ==> 234520 KB
Roughly 2MB for each compilation thread. From my previous experience the native stack does not contribute that much to RSS and Java stack for a compilation thread should be negligible.
I collected smaps, javacores and coredumps for 2 configurations: 1) one compilation thread 2) seven compilation threads
The javacore data shows 6 MB of more virtual memory being used by the 7-compThread config and that difference comes from the native stack. This is correct because compilation threads have 1MB of stack and we have 6 extra compilation threads.
1MEMUSER JRE: 849,435,232 bytes / 12629 allocations
2MEMUSER +--VM: 546,750,696 bytes / 6890 allocations
3MEMUSER | +--Threads: 25,424,808 bytes / 485 allocations
4MEMUSER | | +--Java Stack: 1,979,352 bytes / 80 allocations
4MEMUSER | | +--Native Stack: 22,675,456 bytes / 81 allocations
4MEMUSER | | +--Other: 770,000 bytes / 324 allocations
1MEMUSER JRE: 854,759,824 bytes / 12876 allocations
2MEMUSER +--VM: 552,988,592 bytes / 7044 allocations
3MEMUSER | +--Threads: 31,825,104 bytes / 503 allocations
4MEMUSER | | +--Java Stack: 2,102,096 bytes / 86 allocations
4MEMUSER | | +--Native Stack: 28,966,912 bytes / 87 allocations
4MEMUSER | | +--Other: 756,096 bytes / 330 allocations
However, what we care is the resident set size. I wrote a program that matches the information from smaps with information from javacores and coredump and the results are below: 1-compThread
Totals: Virtual= 5024184 KB; RSS= 226000 KB
GC heap: Virtual= 99072 KB; RSS= 98660 KB
CodeCache: Virtual= 262144 KB; RSS= 10240 KB
DataCache: Virtual= 6144 KB; RSS= 5934 KB
DLL: Virtual= 119536 KB; RSS= 19616 KB
Stack: Virtual= 30280 KB; RSS= 3328 KB
SCC: Virtual= 0 KB; RSS= 0 KB
JITScratch: Virtual= 0 KB; RSS= 0 KB
JITPersist: Virtual= 12288 KB; RSS= 11738 KB
Internal: Virtual= 0 KB; RSS= 0 KB
Classes: Virtual= 46636 KB; RSS= 32573 KB
CallSites: Virtual= 26917 KB; RSS= 23764 KB
Unknown: Virtual= 39328 KB; RSS= 17803 KB
Not covered: Virtual= 4378052 KB; RSS= 2340 KB
7 compThreads
Totals: Virtual= 5032404 KB; RSS= 239892 KB
GC heap: Virtual= 99008 KB; RSS= 98652 KB
CodeCache: Virtual= 262144 KB; RSS= 10372 KB
DataCache: Virtual= 6144 KB; RSS= 5938 KB
DLL: Virtual= 119536 KB; RSS= 19996 KB
Stack: Virtual= 36448 KB; RSS= 2887 KB
SCC: Virtual= 0 KB; RSS= 0 KB
JITScratch: Virtual= 0 KB; RSS= 0 KB
JITPersist: Virtual= 11264 KB; RSS= 11112 KB
Internal: Virtual= 0 KB; RSS= 0 KB
Classes: Virtual= 46508 KB; RSS= 32426 KB
CallSites: Virtual= 27275 KB; RSS= 34936 KB
Unknown: Virtual= 42447 KB; RSS= 21233 KB
Not covered: Virtual= 4367428 KB; RSS= 2336 KB
As one can see, the 7-compThreads configuration uses 13 MB more RSS and the difference does not come from the stack, but rather from the CallSites category. Indeed, if we look at how the smaps are covered by stack regions we realize that stacks contribute little to RSS. Some examples from the 7-compThread config
MemEntry: Start=00007f99d85ee000 End=00007f99d86ee000 Size= 1024 rss= 32 Prot=rw-p
Covering segments/call-sites:
ThreadName="JIT Compilation Thread-006 Suspended" Start=00007f99d85ee000 End=00007f99d86ee000 size= 1024 KB
MemEntry: Start=00007f99d86ef000 End=00007f99d87ef000 Size= 1024 rss= 32 Prot=rw-p
Covering segments/call-sites:
ThreadName="JIT Compilation Thread-005" Start=00007f99d86ef000 End=00007f99d87ef000 size= 1024 KB
MemEntry: Start=00007f99d87f0000 End=00007f99d88f0000 Size= 1024 rss= 32 Prot=rw-p
Covering segments/call-sites:
ThreadName="JIT Compilation Thread-004 Suspended" Start=00007f99d87f0000 End=00007f99d88f0000 size= 1024 KB
MemEntry: Start=00007f99d88f1000 End=00007f99d89f1000 Size= 1024 rss= 32 Prot=rw-p
Covering segments/call-sites:
ThreadName="JIT Compilation Thread-003 Suspended" Start=00007f99d88f1000 End=00007f99d89f1000 size= 1024 KB
MemEntry: Start=00007f99d89f2000 End=00007f99d8af2000 Size= 1024 rss= 32 Prot=rw-p
Covering segments/call-sites:
ThreadName="JIT Compilation Thread-002 Suspended" Start=00007f99d89f2000 End=00007f99d8af2000 size= 1024 KB
MemEntry: Start=00007f99d8af2000 End=00007f99d8af3000 Size= 4 rss= 0 Prot=---p
MemEntry: Start=00007f99d8af3000 End=00007f99d8bf3000 Size= 1024 rss= 32 Prot=rw-p
Covering segments/call-sites:
ThreadName="JIT Compilation Thread-001 Suspended" Start=00007f99d8af3000 End=00007f99d8bf3000 size= 1024 KB
MemEntry: Start=00007f99d9bf4000 End=00007f99d9cf4000 Size= 1024 rss= 32 Prot=rw-p
Covering segments/call-sites:
ThreadName="JIT Compilation Thread-000 Suspended" Start=00007f99d9bf4000 End=00007f99d9cf4000 size= 1024 KB
With DDR we can print the allocations from the callsites sorted by total allocation size:
1-compThread
total alloc | largest
blocks| bytes | bytes | callsite
-------+-------+-------+-------+-------+-------+-------+-------+-------+-------
758 48741720 2097152 segment.c:238
1109 17885368 32768 segment.c:233
278 5548880 32768 CL:326
1 3145728 3145728 CL:186
3 2478048 826016 CopyScanCacheChunk.cpp:38
80 1974232 45248 vmthread.c:1378
1018 1693504 38632 /home/mpirvu/JITaaS/openj9/runtime/compiler/../compiler/runtime/MethodMetaData.c:149
290 1178560 4064 zipcache.c:879
157 873216 8208 ../common/unsafe_mem.c:241
154 794352 276336 CL:671
190 720704 75128 StringTable.cpp:88
82 680272 8296 trclog.c:1022
82 655456 8192 ConfigurationStandard.cpp:273
1 655360 655360 BufferManager.cpp:41
1 651864 651864 jvminit.c:6432
5 541920 108384 WorkPackets.cpp:179
1 524288 524288 ClassFileParser.cpp:78
.............
7-compThreads
total alloc | largest
blocks| bytes | bytes | callsite
-------+-------+-------+-------+-------+-------+-------+-------+-------+-------
756 47564416 2097152 segment.c:238
1111 17882856 32768 segment.c:233
278 5548880 32768 CL:326
1 3145728 3145728 CL:186
3 2478048 826016 CopyScanCacheChunk.cpp:38
86 2096592 45280 vmthread.c:1378
1066 1763560 27136 /home/mpirvu/JITaaS/openj9/runtime/compiler/../compiler/runtime/MethodMetaData.c:149
290 1178560 4064 zipcache.c:879
157 873216 8208 ../common/unsafe_mem.c:241
154 789136 276336 CL:671
88 730048 8296 trclog.c:1022
190 720704 75128 StringTable.cpp:88
88 704608 8192 ConfigurationStandard.cpp:273
1 655360 655360 BufferManager.cpp:41
1 651864 651864 jvminit.c:6432
5 541920 108384 WorkPackets.cpp:179
1 524288 524288 ClassFileParser.cpp:78
.............
I wrote a script to compare the allocations, but I am way short of the 13MB I am looking for. As seen below, the biggest difference comes from vmthread.c:1378
but that accounts for only 122360 bytes.
('vmthread.c:121', -17280)
('jswalk.c:1611', -20480)
('vmthread.c:228', -25152)
('jithash.cpp:273', -32776)
('ConfigurationStandard.cpp:273', -49152)
('trclog.c:1022', -49776)
('/home/mpirvu/JITaaS/openj9/runtime/compiler/../compiler/runtime/MethodMetaData.c:149', -70056)
('vmthread.c:1378', -122360)
As discussed in #12659, most of this per-thread memory footprint comes from the malloc heap. Data reported by malloc_stats
in my experiments is consistent with what is reported here: 1-2 MB per server thread. This is the memory that malloc "holds on to" (does not release to the OS) in per-thread arenas. In order to avoid this additional memory footprint, we need to stop using std::string
and std::vector
with default allocators in JITServer message serialization/deserialization, client session data, etc., and use scratch or persistent allocations instead.
we need to stop using std::string
There was a bug in the STL (I don't know exactly which version of STL, probably the one that corresponds with gcc 4.8) where you couldn't use std::string
with a custom allocator because somewhere in its internals, it would try to instantiate the custom allocator via the allocator's default constructor; however, TR::typed_allocator
uses explicit
for its constructors so you'd end up with a build error.
Given that we're on gcc 7+ now, it might not be a problem anymore.
I observed that footprint during the steady state of AcmeAir increases with the number of compilation threads. This is unexpected because, once a compilation is over, we free all scratch memory. Moreover, persistent memory should not depend on the number of compilation threads. If we understand this issue we might be able to reduce footprint somewhat. First step would be to determine what kind of memory increases when more compilation threads are used.