Open Mesbah-Alam opened 5 years ago
5x Grinder passed 100%: https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/1660/console
Summary of what the failing test is doing:
1) Given a list of instantiated MemoryPoolMXBean
objects, the test invokes the java.lang.management.MemoryPoolMXBean.getUsage()
API to get the "current usage" value; and then it invokes java.lang.management.MemoryPoolMXBean.getPeakUsage()
to get the "peak usage" value.
2) It then performs the following validation:
long peakUsageUsed = peak_usage.getUsed();
long currentUsageUsed = current_usage.getUsed();
if (peakUsageUsed < currentUsageUsed) {
Message.logErr(" Peak usage smaller than current usage here:");
Message.logErr(" Memory Pool: " + poolName);
Message.logErr(" Memory Type: " + poolType);
Message.logErr(" Peak Usage: " + peak_usage.toString());
Message.logErr(" Current Usage: " + current_usage.toString());
Assert.fail("Peak Usage used memory smaller than Current Usage used memory");
}
So, intermittently, on various platforms, this validation is failing.
Question is - whether or not this failure indicates something wrong with what we are testing - i.e., the JLM API of the SDK under test.
Java doc for java.lang.management.MemoryPoolMXBean.getUsage()
Returns an estimate of the memory usage of this memory pool. This method returns null if this memory pool is not valid (i.e. no longer exists).
This method requests the Java virtual machine to make a best-effort estimate of the current memory usage of this memory pool. For some memory pools, this method may be an expensive operation that requires some computation to determine the estimate. An implementation should document when this is the case.
This method is designed for use in monitoring system memory usage and detecting low memory condition.
Javadoc for java.lang.management.MemoryPoolMXBean.getPeakUsage()
Returns the peak memory usage of this memory pool since the Java virtual machine was started or since the peak was reset. This method returns null if this memory pool is not valid (i.e. no longer exists).
FYI.. @lumpfish @JasonFengJ9
@LinHu2016 can you please take a look. Any relation to https://github.com/eclipse/openj9/pull/5324?
@pshipton I will take a look the issue, Thanks
base on example (https://ci.adoptopenjdk.net/view/Test_system/job/openjdk12_hs_systemtest_x86-64_windows/71/tapResults/), the error is for OpenJDK hotspot G1, please provide the error case on OpenJ9, Thanks
CL1 stderr j> 2019/06/10 11:07:07.499 Peak usage smaller than current usage here: CL1 stderr j> 2019/06/10 11:07:07.500 Memory Pool: G1 Old Gen CL1 stderr j> 2019/06/10 11:07:07.501 Memory Type: Heap memory CL1 stderr j> 2019/06/10 11:07:07.501 Peak Usage: init = 118489088(115712K) used = 12219608(11933K) committed = 118489088(115712K) max = 268435456(262144K) CL1 stderr j> 2019/06/10 11:07:07.502 Current Usage: init = 118489088(115712K) used = 12241080(11954K) committed = 50331648(49152K) max = 268435456(262144K) CL1 stderr Exception in thread "main" java.lang.AssertionError: Peak Usage used memory smaller than Current Usage used memory CL1 stderr at org.junit.Assert.fail(Assert.java:88) CL1 stderr at net.adoptopenjdk.test.jlm.resources.MemoryData.checkPeakAndCurrentMemoryUsage(MemoryData.java:521) CL1 stderr at net.adoptopenjdk.test.jlm.resources.MemoryData.writeData(MemoryData.java:182) CL1 stderr at net.adoptopenjdk.test.jlm.remote.MemoryProfiler.getStatsViaProxy(MemoryProfiler.java:183) CL1 stderr at net.adoptopenjdk.test.jlm.remote.MemoryProfiler.main(MemoryProfiler.java:110)
@LinHu2016 : The error is intermittent. Started a Grinder for OpenJ9 here: https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/1664/console
Hi @Mesbah-Alam,Could you please share test code or point the test src code location? if it has the issue on both Hotspot and OpenJ9, sounds like issue is on JCL or testing code.
@LinHu2016 ,
The problem arises here: https://github.com/AdoptOpenJDK/openjdk-systemtest/blob/5d351a3f33c6e792c646b7a08539778c8a56caac/openjdk.test.jlm/src/test.jlm/net/adoptopenjdk/test/jlm/resources/MemoryData.java#L521
The actual test class is https://github.com/AdoptOpenJDK/openjdk-systemtest/blob/master/openjdk.test.jlm/src/test.jlm/net/adoptopenjdk/test/jlm/TestJlmRemoteMemoryNoAuth.java
another thought it might relate nonheap memory pool(such as codecache?), wait the grinder to catch the issue on OpenJ9 for investigating the details.
@LinHu2016 : The error is intermittent. Started a Grinder for OpenJ9 here: https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/1664/console
@Mesbah-Alam the Grinder failed to run the tests.
Likely something is wrong with the Windows build at Adopt. The failure above happened due to AUTO_DETECT mechanism being turned on but the java -version output not containing "Compressed references" string. I launched another Grinder with AUTO_DETECT off, using the published build at Adopt, that failed too: https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/1675/tapResults/
Still occurring from time to time: https://ci.adoptopenjdk.net/job/Test_openjdk8_hs_sanity.system_aarch64_linux/400
This is still occurring in the test builds. @LinHu2016 - are you able to provide any insight into when the values returned by the API calls
java.lang.management.MemoryPoolMXBean.getUsage()
java.lang.management.MemoryPoolMXBean.getPeakUsage()
are updated by the JVM?
I'm looking for a possible explanation for the intermittent test case behavior - for instance, is the data returned by getUsage() always 'live' data but the getPeakUsage() data only updated periodically?
Hi @lumpfish Could you please provide a failure case? I would like investigate the case. In general for heap memory pools, the peakUsage would be updated at the begin of GC and at the end of GC or when getUsage(), getPeakUsage() and resetPeakUsage() are called by user, so usage should never be larger than peakUsage. previously there was a bug for retrieving usage for the non heap memory pool, I would like confirm if the failure is related with the fix. Thanks
also getUsage(), getPeakUsage() are two independent api, heap usage could be changed between api calls, so for realtime you should call getUsage before calling getPeakUsage if you want to compare between the two sizes.
so above the validation code should be like this
long currentUsageUsed = current_usage.getUsed();
long peakUsageUsed = peak_usage.getUsed();
if (peakUsageUsed < currentUsageUsed) {
Message.logErr(" Peak usage smaller than current usage here:");
Message.logErr(" Memory Pool: " + poolName);
Message.logErr(" Memory Type: " + poolType);
Message.logErr(" Peak Usage: " + peak_usage.toString());
Message.logErr(" Current Usage: " + current_usage.toString());
Assert.fail("Peak Usage used memory smaller than Current Usage used memory");
}
@LinHu2016 - thank you for the insight about why current usage might be reported as higher than peak usage, I'll try reversing the API calls as you suggest and see if it makes a difference.
I haven't managed to recreate this failure in a smaller, simpler test case, nor reproduce it locally - I have only seen the failure on Jenkins slave machines.
If you want to run the test locally:
Build the openjdk-systemtest and stf repositories. Either build via testKitGen (https://github.com/AdoptOpenJDK/openjdk-tests/blob/master/doc/userGuide.md) or follow https://github.com/adoptopenjdk/openjdk-systemtest
Run the test.
For Windows:
set JAVA_HOME=java_to_test
set PATH=java_to_test\bin;%PATH%
set SYSTEMTEST_HOME=git_checkout_location - e.g. c:\Users\user\git
set RESULTS_HOME=somewhere_to_put_results - e.g. c:\Users\user\stf_results
perl %SYSTEMTEST_HOME%/stf/stf.core/scripts/stf.pl -test-root="%SYSTEMTEST_HOME%/stf;%SYSTEMTEST_HOME%/openjdk-systemtest" -systemtest-prereqs="%SYSTEMTEST_HOME%/../systemtest_prereqs" -results-root="%RESULTS_HOME%/TestJlmRemoteMemoryNoAuth" -test=TestJlmRemoteMemoryNoAuth
For Unix
export JAVA_HOME=java_to_test
export PATH=java_to_test/bin:$PATH
export SYSTEMTEST_HOME=git_checkout_location - e.g. /home/user/git
export RESULTS_HOME=somewhere_to_put_results - e.g. /home/user/stf_results
perl $SYSTEMTEST_HOME/stf/stf.core/scripts/stf.pl -test-root="$SYSTEMTEST_HOME/stf;$SYSTEMTEST_HOME/openjdk-systemtest" -systemtest-prereqs="$SYSTEMTEST_HOME/../systemtest_prereqs" -results-root="$RESULTS_HOME/TestJlmRemoteMemoryNoAuth" -test=LambdaLoadTest
The test creates a new directory under RESULTS_HOME. In there you will see an execute.pl file which is run during the test but can be rerun manually. So the commands issued in the test can be tweaked (e.g. to add extra arguments to java command lines) and the test rerun.
Each time the test is run via stf.pl a new directory is created with a newly generated execute.pl etc. files.
The java test case code is in the openjdk.test.jlm directory in the openjdk-systemtest repository (under SYSTEMTEST_HOME). If you make want to try changes to those files, edit, rebuild and rerun.
so above the validation code should be like this
long currentUsageUsed = current_usage.getUsed(); long peakUsageUsed = peak_usage.getUsed(); if (peakUsageUsed < currentUsageUsed) { Message.logErr(" Peak usage smaller than current usage here:"); Message.logErr(" Memory Pool: " + poolName); Message.logErr(" Memory Type: " + poolType); Message.logErr(" Peak Usage: " + peak_usage.toString()); Message.logErr(" Current Usage: " + current_usage.toString()); Assert.fail("Peak Usage used memory smaller than Current Usage used memory"); }
I mean call getUsagge() before calling getPeakUsage(), not current_usage.getUsed() before peak_usage.getUsed(), so for the confusion.
After more investigation (using hotspot jdk8 mac, which shows this failure more often than other platforms (about 1 in 10 test runs)), is seems that peak_usage is not updated often enough to guarantee that it will always be equal to or greater than the maximum value that current usage has ever had.
Adding some additional code to the test case demonstrates this:
Memory Pool: Code Cache
Memory Type: Non-heap memory
Part of Memory System: Yes
174: Getting current usage:
Current Memory Usage: init = 2555904(2496K) used = 26777216(26149K) committed = 27262976(26624K) max = 251658240(245760K)
178: Getting peak usage:
Peak Memory Usage: init = 2555904(2496K) used = 26774912(26147K) committed = 27262976(26624K) max = 251658240(245760K)
185: Checking peak usage:
Trace: peakUsageUsed = 26774912, currentUsageUsed = 26777216
Trace: peakUsageUsed = 26774912, currentUsageUsed = 26777216
Trace: peakUsageUsed less than currentUsageUsed, getting values again
Trace: new values peakUsageUsed_2 = 26774912, currentUsageUsed_2 = 26772096
The test case was modified such that if the returned peak usage value was less than the returned current usage value the values were retrieved once more. Upon retrieving the values a second time, the peak usage value was unchanged, but the current usage value has fallen back below the peak usage value.
The openjdk test case https://github.com/openjdk/jdk/blob/05a764f4ffb8030d6b768f2d362c388e5aabd92d/test/hotspot/jtreg/compiler/codecache/jmx/PeakUsageTest.java#L77 performs a similar test, suggesting that the behavior observed here is not expected.
current_usage.getUsed() should be called before peak_usage.getUsed() ? when will this issue will be fixed ?
current_usage.getUsed() should be called before peak_usage.getUsed() ? when will this issue will be fixed ?
Please see the additional comment in https://github.com/AdoptOpenJDK/openjdk-systemtest/issues/274#issuecomment-760313329
I mean call getUsagge() before calling getPeakUsage(), not current_usage.getUsed() before peak_usage.getUsed(), so for the confusion.
I am unable to see any error in in the test case code and have not been able to identify the presumed timing window in the openjdk code - see comment https://github.com/AdoptOpenJDK/openjdk-systemtest/issues/274#issuecomment-772473352.
I have not seen this test fail in an openj9 jdk for a long time, only hotspot jdks.
@caozg2021 - I assume from your interest you are also seeing this issue. Is that from running this test case or are you detecting it in a production system?
@lumpfish i test my compiled jdk (hotspot) running this test case using openjdk-systemtest,not a production system.
What jdk version / platform did you build? I haven't been able to reproduce this on my local laptop - I think it might be easier to debug if I could.
@lumpfish JDK is corretto 8.275.01.1 Linux x64 (https://github.com/corretto/corretto-8/releases), I tested in docker (CentOS Linux release 7.8.2003 (Core)) My test script is
a=0
int=0
while (( $int <= 0))
do
perl $SYSTEMTEST_HOME/stf/stf.core/scripts/stf.pl -test-root="$SYSTEMTEST_HOME/stf;$SYSTEMTEST_HOME/openjdk-systemtest" -systemtest-prereqs="$SYSTEMTEST_HOME/../systemtest_prereqs" -test=TestJlmRemoteMemoryNoAuth
if [ $? -ne 0 ]
then
let "int=1"
fi
let "a=a+1"
echo "loop count " $a
done
I run it to reproduce this issue about some hours.
I add some debug info in CodeHeapPool::used_in_bytes to print _next_segment and _freelist_segments
debug info like this
tty->print("Code Cache printInfo %d, %d\n", _next_segment, _freelist_segments);
and print the CodeBlob's name at CodeCache::free
and open log in NMethodSweeper::sweep_code_cache
then reproduce this issue ,the debug is below
Code Cache printInfo 358516, 1984
Code Cache printInfo 358516, 1984
Code Cache printInfo 358516, 1984
### Sweep at 0 out of 4252. Invocations left: 16
Code Cache printInfo 358557, 1984
UsageMemory Code Cache init 2555904, used 22820672, commit 23003136, max 251658240
### Sweep at 265 out of 4253. Invocations left: 15
CodeCache::free nmethod
### Sweep at 530 out of 4253. Invocations left: 14
CodeCache::free nmethod
### Sweep at 795 out of 4253. Invocations left: 13
CodeCache::free nmethod
Code Cache printInfo 358598, 2060
PeakMemory Code Cache init 2555904, used 22818432, commit 23003136, max 251658240
Code Cache printInfo 358598, 2060
Code Cache printInfo 358598, 2060
Code Cache printInfo 358598, 2060
I think this issue is caused by CodeCache sweep, First get the CodeCache pool 's UsageMemory _next_segment is 358557 _freelist_segments is 1984, but now the MemoryPool:_peak_usage not update ,then the sweeper sweep the CodeCache , the _freelist_segments increase , after that , get the CodeCache pool's PeakMemory, _next_segment is 358598 _freelist_segments is 2060 . so the CodeCache's peak memory usage is less than the current memory usage.
I had been investigating along these lines:
Running locally
I never thought to look at Code Cache sweep though.
the CodeCache usage is related to _next_segment - _freelist_segments
In my debug log , _next_segment alaways increase, but _freelist_segments increase or decrease dynamically . @lumpfish did you modified all the memorypool‘s get_memory_usage , and how to test this modify would have a performance hit?
In my experiment I only modified MemoryUsage CodeHeapPool::get_memory_usage()
- that is the only memory pool I've seen the error in - but I can't see a logical reason why that would be the only memory pool affected. Perhaps it is simply more dynamic in its behavior?
I added a call:
update_peak_memory_usage(initial_size(), used, committed, maxSize);
before MemoryUsage CodeHeapPool::get_memory_usage()
returns its MemoryUsage object, and created
MemoryPool::update_peak_memory_usage(size_t initial_size, size_t used, size_t committed, size_t max_size)
with the same logic as record_peak_memory_usage()
but without the call to get_memory_usage()
, using the passed in parameters instead.
It appears to stop the problem occurring - but now peak_usage will be updated more frequently. As a rough performance assessment I was going to add counters for the number of times peak_usage is updated with and without the fix. I haven't got round to it yet - busy on other things at the moment.
I tried reproducing the issue using the openjdk head stream (openjdk/jdk repository), but could not. Have you seen it on a release other than jdk8?
My working JDK is jdk8,I only run the systemtest on jdk8.
I've created a branch with my attempt at a fix: https://github.com/AdoptOpenJDK/openjdk-jdk8u/compare/master...lumpfish:fix_peak_memory_usage
The change currently includes code to turn the behaviour on and off and print some diagnostics using environment variables:
Running the test with and without UPDATE_PEAK_USAGE_IN_GET_MEMORY_USAGE set shows that with the change _peak_usage is updated approximately 1.5 to 2.5 times more frequently with the change depending on the memory pool. The test runs for approximately 5 minutes.
Output at end of the test without the change (i.e. without enviroment variable UPDATE_PEAK_USAGE_IN_GET_MEMORY_USAGE set):
_peak_usage_update_count for memoryPool Metaspace now 13700
_peak_usage_update_count for memoryPool Code Cache now 20600
_peak_usage_update_count for memoryPool PS Eden Space now 12800
_peak_usage_update_count for memoryPool PS Survivor Space now 12800
_peak_usage_update_count for memoryPool PS Old Gen now 12800
_peak_usage_update_count for memoryPool Compressed Class Space now 12900
Output at end of the test with the change (i.e. with enviroment variable UPDATE_PEAK_USAGE_IN_GET_MEMORY_USAGE set):
_peak_usage_update_count for memoryPool PS Eden Space now 32800
_peak_usage_update_count for memoryPool PS Survivor Space now 32800
_peak_usage_update_count for memoryPool PS Old Gen now 26400
_peak_usage_update_count for memoryPool Compressed Class Space now 26200
_peak_usage_update_count for memoryPool Code Cache now 34100
_peak_usage_update_count for memoryPool Metaspace now 27100
I have created a standalone client which can attach to any java process (started with the required command line options) and monitor the peak usage in that process.
Using this client the issue can also be seen when running the java db-shootout benchmark from the renaissance benchmarks project https://github.com/renaissance-benchmarks/renaissance. (Presumably other benchmarks would also demonstrate the issue.)
When running the benchmark for 5 minutes, peak usage being lower than current usage is reported about one in eight runs.
This is the client test case: PeakUsageRemoteTest.java.txt. (The .txt extension is required for github to accept the file. Remove the .txt extension (leaving the .java extension) after downloading.
To allow the client to be attached to a running process, that process must have the following options added to its command line:
-Dcom.sun.management.jmxremote.port=1234 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false
To reproduce using the db-shootout benchmark:
/home/user/Downloads
java -Dcom.sun.management.jmxremote.port=1234 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -jar /home/user/Downloads/renaissance-gpl-0.11.0.jar db-shootout -t 300
cd /home/user && java -cp . PeakUsageRemoteTest
The test case monitors the peak usage every second 240 times (so should complete before the benchmark). If the benchmark ends while the client test case is still running the client ends abnormally with various exceptions (ConnectException, SocketException or others).
If at any point peak usage is lower than current usage the test case outputs a message:
Code Cache : *** Peak usage (8827328) smaller than current usage (8834368) ***
showing the name of the memory pool with the error (Code Cache), and the usage values returned. I've only ever seen the error reported for 'Code Cache'.
Since the error occurs infrequently this shell script will run both the benchmark and the client repeatedly and should ensure that the benchmark never ends before the client test case.
a=0
rc=0
client_secs=240
benchmark_secs=300
while [ $rc -eq 0 ]
do
a=$((a + 1))
echo "Iteration" $a
# Run benchmark in background
java -Dcom.sun.management.jmxremote.port=1234 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -jar /home/user/Downloads/renaissance-gpl-0.11.0.jar db-shootout -t $benchmark_secs >/dev/null 2>&1 &
bpid=$!
# Run client in foreground for less time than the benchmark
java -cp . PeakUsageRemoteTest
# If the test failed stop the loop
if [ $? -ne 0 ]
then
rc=1
fi
# wait for the benchmark to finish
echo "PeakUsageRemoteTest finished, waiting for background process $bpid to finish"
wait $bpid
echo "Background process $bpid finished"
echo "Iteration" $a "finished"
done
@andrew-m-leonard - FYI, https://github.com/adoptium/aqa-systemtest/issues/274#issuecomment-806646077
I've now seen this on a jdk11 hotspot aarch64_linux nightly build for a different memory pool. So not unique to jdk8 or to the 'Code Cache' memory pool. Failing job link: https://ci.adoptopenjdk.net/job/Test_openjdk11_hs_sanity.system_aarch64_linux/477/consoleFull
20:42:40 CL2 j> 2021/03/30 19:42:39.276 Starting to write data
20:43:04 CL2 stderr j> 2021/03/30 19:43:02.041 Peak usage smaller than current usage here:
20:43:04 CL2 stderr j> 2021/03/30 19:43:02.042 Memory Pool: G1 Old Gen
20:43:04 CL2 stderr j> 2021/03/30 19:43:02.045 Memory Type: HEAP
20:43:04 CL2 stderr j> 2021/03/30 19:43:02.045 Peak Usage: init = 106954752(104448K) used = 82918016(80974K) committed = 119537664(116736K) max = 268435456(262144K)
20:43:04 CL2 stderr j> 2021/03/30 19:43:02.046 Current Usage: init = 106954752(104448K) used = 83638704(81678K) committed = 119537664(116736K) max = 268435456(262144K)
20:43:04 CL2 stderr Exception in thread "main" java.lang.AssertionError: Peak Usage used memory smaller than Current Usage used memory
The test fails randomly with jdk8 at a high rate and according to the test history reports it's not platform nor machine related. While the test validity is still under investigation I will exclude it for jdk8 temurin.
also happened with TestJlmRemoteMemoryAuth on alpine-linux
The similar JDK test was failing and excluded temporarily, but fixed here: https://github.com/openjdk/jdk/commit/c0cfad4f3d40852f8b9d1944c07727ddd3278265
The peak usage is acquired before current usage which is done in TestJlmRemoteMemoryAuth
, however they also have calls to bean.resetPeakUsage()
before acquiring peak usage. TestJlmRemoteMemoryAuth
does not seem to call that. Maybe doing so will bypass the issue where peak usage isn't up-to-date because it will force peak usage to be up-to-date.
TestJlmRemoteMemoryNoAuth test intermittently fails with
Peak Usage used memory smaller than Current Usage used memory
.We should revisit the unstable test to check its validity.
Example:
https://ci.adoptopenjdk.net/view/Test_system/job/openjdk12_hs_systemtest_x86-64_windows/71/tapResults/