eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.28k stars 722 forks source link

Exception from MemoryUsage.<init> when running with 1G hugepages #18274

Closed JamesKingdon closed 1 year ago

JamesKingdon commented 1 year ago

Problem was reported on Java 11.0.21

The exception text is

Exception in thread "MemoryMXBean notification dispatcher" java.lang.IllegalArgumentException: committed value cannot be larger than the max value

Investigation indicates that the JIT allocated a 1G segment for the code cache which is reported for the 'committed' memory value, while the max value is 256MB.

mpirvu commented 1 year ago

@SajinaKandy Could you please look into this problem?

More info for reproducing: the system had 1GB large pages enabled. This page shows how to enable large pages: https://www.intel.com/content/www/us/en/docs/programmable/683840/1-2-1/enabling-hugepages.html

SajinaKandy commented 1 year ago

Relevant comments from the case: The latest dump narrows down the problem to the JIT component. One of the new tracepoints added by the JVM team only fires when there's a problem, and there are three instances of this tracepoint in the dump's trace buffers. All of them are very similar:

14:29:04.286204531 0x291000 j9jcl.533 Exception * JCL: processNotificationLoop memoryUsageError: GcName=global, MemoryPool=JIT code cache, initialSize=1073741824, preUsedSize=1312896, preCommittedSize==1073741824, preMaxSize=268435456, postUsedSize=1312896, postCommittedSize==1073741824, postMaxSize=268435456

Note that the pre/post GC committed sizes for the JIT code cache are larger than the maximum sizes, which doesn't make sense:

preCommittedSize = 1073741824
      preMaxSize = 268435456

postCommittedSize = 1073741824
      postMaxSize = 268435456

This is the cause of the IllegalArgumentException later that triggered the dump: java.lang.IllegalArgumentException: committed value cannot be larger than the max value

SajinaKandy commented 1 year ago

Tracepoints capturing the exception:

14:29:04.286175346 *0x0 j9mm.475             Event       GlobalGC end: workstackoverflow=0 overflowcount=0 memory=13903136/37945344
14:29:04.286202934 *0x291000 j9jcl.534            Event       JCL: processNotificationLoop memoryUsage: GcName=global, MemoryPool=tenured-SOA, initialSize=5976064, preUsedSize=21538192, preCommittedSize==22793216, preMaxSize=12109836288, postUsedSize=21104312, postCommittedSize==28398592, postMaxSize=12109836288
14:29:04.286203371  0x291000 j9jcl.534            Event       JCL: processNotificationLoop memoryUsage: GcName=global, MemoryPool=tenured-LOA, initialSize=315392, preUsedSize=0, preCommittedSize==1455104, preMaxSize=772968448, postUsedSize=0, postCommittedSize==1813504, postMaxSize=772968448
14:29:04.286203608  0x291000 j9jcl.534            Event       JCL: processNotificationLoop memoryUsage: GcName=global, MemoryPool=nursery-allocate, initialSize=1048576, preUsedSize=434096, preCommittedSize==5111808, preMaxSize=2129285120, postUsedSize=316456, postCommittedSize==5111808, postMaxSize=2129285120
14:29:04.286203815  0x291000 j9jcl.534            Event       JCL: processNotificationLoop memoryUsage: GcName=global, MemoryPool=nursery-survivor, initialSize=1048576, preUsedSize=2621440, preCommittedSize==2621440, preMaxSize=1091940352, postUsedSize=2621440, postCommittedSize==2621440, postMaxSize=1091940352
14:29:04.286204095  0x291000 j9jcl.534            Event       JCL: processNotificationLoop memoryUsage: GcName=global, MemoryPool=class storage, initialSize=0, preUsedSize=28502208, preCommittedSize==28997792, preMaxSize=-1, postUsedSize=28502208, postCommittedSize==28997792, postMaxSize=-1
14:29:04.286204360  0x291000 j9jcl.534            Event       JCL: processNotificationLoop memoryUsage: GcName=global, MemoryPool=miscellaneous non-heap storage, initialSize=17825792, preUsedSize=7360992, preCommittedSize==22020096, preMaxSize=-1, postUsedSize=7360992, postCommittedSize==22020096, postMaxSize=-1
14:29:04.286204531  0x291000 j9jcl.533            Exception  * JCL: processNotificationLoop memoryUsageError: GcName=global, MemoryPool=JIT code cache, initialSize=1073741824, preUsedSize=1312896, preCommittedSize==1073741824, preMaxSize=268435456, postUsedSize=1312896, postCommittedSize==1073741824, postMaxSize=268435456
14:29:04.286204705  0x291000 j9jcl.534            Event       JCL: processNotificationLoop memoryUsage: GcName=global, MemoryPool=JIT code cache, initialSize=1073741824, preUsedSize=1312896, preCommittedSize==1073741824, preMaxSize=268435456, postUsedSize=1312896, postCommittedSize==1073741824, postMaxSize=268435456
14:29:04.286204889  0x291000 j9jcl.534            Event       JCL: processNotificationLoop memoryUsage: GcName=global, MemoryPool=JIT data cache, initialSize=0, preUsedSize=2097152, preCommittedSize==2097152, preMaxSize=402653184, postUsedSize=2097152, postCommittedSize==2097152, postMaxSize=402653184
14:29:04.286205881  0x291000 j9mm.258             Exit       <MM_ConcurrentGC_internalPostCollect subspace=0x7f0d7c1e8060
14:29:04.286207918  0x291000 j9mm.75              Event       Concurrent collection end: newspace=4795352/7733248 oldspace=9107784/30212096 loa=1813504/1813504
14:29:04.286208200 *0x0 j9mm.75              Event       Concurrent collection end: newspace=4795352/7733248 oldspace=9107784/30212096 loa=1813504/1813504
SajinaKandy commented 1 year ago

When large page is enabled the code cache allocation happens as a multiple of the available page size. codeCacheSizeToAllocate = (codeCacheSizeToAllocate + (vmemParams.pageSize-1)) & (~(vmemParams.pageSize-1)); https://github.com/eclipse-openj9/openj9/blob/3060a8d8c95d628e78b1a77ac477c59e49db09f6/runtime/compiler/runtime/J9CodeCacheManager.cpp#L329

 if (largeCodePageSize > 0)
#endif
      {
      vmemParams.pageSize = largeCodePageSize;

https://github.com/eclipse-openj9/openj9/blob/3060a8d8c95d628e78b1a77ac477c59e49db09f6/runtime/compiler/runtime/J9CodeCacheManager.cpp#L296C[…]99C47

With large page size of 2M default allocation:

#CODECACHE:  TR::CodeCache::allocate : requesting 268435456 bytes
#CODECACHE:  TR::CodeCache::allocate : largeCodePageSize = 2097152

With large page size of 2M and -Xcodecachetotal1M -Xcodecache1M

#CODECACHE:  TR::CodeCache::allocate : requesting 2097152 bytes
#CODECACHE:  TR::CodeCache::allocate : largeCodePageSize = 2097152

In case large page is enabled we should modify the codeCacheTotalKB accordingly to avoid the exception.

https://github.com/eclipse-openj9/openj9/blob/3e6498069abf66c8dd67aea5afe39f9534bee14e/runtime/jcl/common/mgmtinit.c#L997

mukul4u2005 commented 1 year ago

Hi , May in know when this fix will be available?

Thanks

SajinaKandy commented 1 year ago

@mukul4u2005 Currently its targeted for the Jan refresh as detailed here.

mpirvu commented 1 year ago

I could enable 1G pages on my home system and I can test this:

  -Xlp:codecache:pagesize=1G     large page size for JIT code cache
                  available large page sizes for JIT code cache:
                  4K
                  1G
mpirvu commented 1 year ago

The issue was fixed by https://github.com/eclipse-openj9/openj9/pull/18342 FYI: @mukul4u2005, since you were interested in this fix. The fix is available for testing purposes in the nightly builds (e.g. https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Build_JDK17_x86-64_linux_Nightly/) and will be officially released and documented with OpenJ9 release 0.43 which will appear sometime in January 2024. In the absence of a fix, any of the following workarounds can be used:

  1. Disable 1 GB large pages on the system. Use 2 MB large pages instead
  2. Disable the usage of large pages for the code cache: -Xlp:codecache:pagesize=4k
  3. Continue to use 1 GB large pages, but allow the total amount of code cache to be 1 GB: -XX:codecachetotal=1G

Workaround (2) may reduce throughput somewhat. Workaround (3) may increase footprint (up to 1 GB)

JamesKingdon commented 1 year ago

Thank you Sajina and Marius, much appreciated.

mukul4u2005 commented 1 year ago

@mpirvu Thanks for update!

Will it be part of openj9 Jre 11 release also?

mpirvu commented 1 year ago

Will it be part of openj9 Jre 11 release also?

Yes, all supported versions of java (including Java11) will have the fix.