adoptium / infrastructure

This repo contains all information about machine maintenance.
Apache License 2.0
86 stars 102 forks source link

OOM on AIX jdk_net_1 && jdk_net_0 tests (extended.openjdk) #3523

Open RadekCap opened 7 months ago

RadekCap commented 7 months ago

Test Info Test Name: jdk_net_1 Test Duration: 1 hr 45 min 17 sec Machine: test-osuosl-aix72-ppc64-1 TRSS link for the test output: https://trss.adoptium.net/output/test?id=65d35cd943ff67006e58d3c3

Build Info Build Name: Test_openjdk11_hs_extended.openjdk_ppc64_aix_testList_0 Jenkins Build start time: Feb 19 2024, 03:20 am Jenkins Build URL: https://ci.adoptium.net/job/Test_openjdk11_hs_extended.openjdk_ppc64_aix_testList_0/110/ TRSS link for the build: https://trss.adoptium.net/allTestsInfo?buildId=65d359fb43ff67006e589120

Java Version openjdk version "11.0.23-beta" 2024-04-16 OpenJDK Runtime Environment Temurin-11.0.23+3-202402190059 (build 11.0.23-beta+3-ea) OpenJDK 64-Bit Server VM Temurin-11.0.23+3-202402190059 (build 11.0.23-beta+3-ea, mixed mode)

This test has been failed 19 times since Apr 19 2023, 08:57 pm Java Version when the issue first seen openjdk version "11.0.19" 2023-04-18 OpenJDK Runtime Environment Temurin-11.0.19+7 (build 11.0.19+7) OpenJDK 64-Bit Server VM Temurin-11.0.19+7 (build 11.0.19+7, mixed mode) Jenkins Build URL: https://ci.adoptium.net/job/Test_openjdk11_hs_extended.openjdk_ppc64_aix_testList_0/76/

The test failed on machine test-osuosl-aix72-ppc64-1 3 times The test failed on machine test-osuosl-aix72-ppc64-4 3 times The test failed on machine test-osuosl-aix72-ppc64-5 3 times The test failed on machine test-osuosl-aix72-ppc64-3 5 times The test failed on machine test-osuosl-aix72-ppc64-2 4 times The test failed on machine test-osuosl-aix72-ppc64-6 1 times

Rerun in Grinder


From jdk_net, 2 testcases failing:

RadekCap commented 7 months ago

The deep history indicates it's a pure failure: https://trss.adoptium.net/deepHistory?testId=662075f3879917006ea74ab6

RadekCap commented 7 months ago

Both ends on network connection failures:

INFO: ERROR: java.io.IOException: A connection with a remote socket was reset by that socket.

and

INFO: MISC: Closing: PlainHttpConnection: HttpConnection: java.nio.channels.SocketChannel[connected local=/127.0.0.1:61438 remote=localhost/127.0.0.1:61106]
TestServer: Connection writer stopping
Apr 17, 2024 5:25:47 PM jdk.internal.net.http.PlainHttpConnection close
INFO: MISC: Closing: PlainHttpConnection: HttpConnection: java.nio.channels.SocketChannel[connected local=/127.0.0.1:61437 remote=localhost/127.0.0.1:61106]

I'm attaching jtr files. SpecialHeadersTest.jtr.txt

SpecialHeadersTest.jtr.txt

RadekCap commented 7 months ago

Updated names as jdk_net_0 has the same failures.

smlambert commented 7 months ago

The two testcases are failing with OutOfMemoryError, which could be a limitation of the machines we have on the public Jenkins server. Trying a run on the temurin-compliance Jenkins server to see if same issue occurs (for those with access to that private server, the link is TCGrinder/4238) - passes on jck-skytap-aix72-ppc64-4 Grinder_20240419103418_JDK11_AIX.tap.txt

I will transfer this issue to the infrastructure repository to see if there is a way to ensure we have same capacity / config on the public AIX machines versus the one attached to the TC Jenkins server.

sxa commented 5 days ago

I've changed this issue title to be a generic limits issue for AIX. @andrew-m-leonard is this the same as what you saw at some point in the last week? It mentions java.lang.OutOfMemoryError: unable to create native thread: which sounds similar to what you were seeing.

Also noting that ref https://github.com/adoptium/infrastructure/issues/3065#issuecomment-2493619252 there is an error Execution failed: main threw exception: java.lang.OutOfMemoryError: Unable to allocate 1073741824 bytes occuring in java/nio/channels/FileChannel/LargeGatheringWrite.java (I'm expecting to close that issue when https://github.com/adoptium/aqa-tests/pull/5771 is merged, which will mean this issue can be used to track that too.

Based on earlier comment I'm also trying that on the TC server with grinders 4639-4641 (edit: All failed with TEST RESULT: Failed. Execution failed: 'main' threw exception: java.io.IOException: No space left on device since the test tries to write ~2GiB to /tmp location (Ref: https://github.com/adoptium/infrastructure/issues/3129) and the TC machines don't have enough available.

andrew-m-leonard commented 2 days ago

I've changed this issue title to be a generic limits issue for AIX. @andrew-m-leonard is this the same as what you saw at some point in the last week? It mentions java.lang.OutOfMemoryError: unable to create native thread: which sounds similar to what you were seeing.

Also noting that ref #3065 (comment) there is an error Execution failed: main threw exception: java.lang.OutOfMemoryError: Unable to allocate 1073741824 bytes occuring in java/nio/channels/FileChannel/LargeGatheringWrite.java (I'm expecting to close that issue when adoptium/aqa-tests#5771 is merged, which will mean this issue can be used to track that too.

Based on earlier comment I'm also trying that on the TC server with grinders 4639-4641 (edit: All failed with TEST RESULT: Failed. Execution failed: 'main' threw exception: java.io.IOException: No space left on device since the test tries to write ~2GiB to /tmp location (Ref: #3129) and the TC machines don't have enough available.

The only issue I saw last week I think was the timeout after 5mins scheduling the nodes (I think!)