adoptium / infrastructure

This repo contains all information about machine maintenance.
Apache License 2.0
85 stars 101 forks source link

Tracking issue: Networking Issues #3190

Open adamfarley opened 11 months ago

adamfarley commented 11 months ago

Summary

This issue is for storing details of networking issues seen during triage or otherwise.

Details

Each entry should include the date, host, error message, and a URL.

The problems listed here can have issues elsewhere, but are primarily for unpredictable, temporary issues.

Examples:

sxa commented 10 months ago

GitHub access issue on [mac installer creation](job https://ci.adoptium.net/job/build-scripts/job/release/job/create_installer_mac/10953/console) (No machine selected at this point)

ERROR: Error cloning remote repo 'origin'
hudson.plugins.git.GitException: Command "git fetch --tags --force --progress -- https://github.com/adoptium/installer.git +refs/heads/*:refs/remotes/origin/*" returned status code 128:
stdout: 
stderr: fatal: unable to access 'https://github.com/adoptium/installer.git/': Failed to connect to github.com port 443: Operation timed out
adamfarley commented 8 months ago

https://ci.adoptium.net/job/Test_openjdk17_hs_sanity.openjdk_aarch64_linux/373/consoleFull - test-docker-centos8-armv8-1 https://ci.adoptium.net/job/Test_openjdk17_hs_extended.openjdk_aarch64_linux_testList_1/42/console - test-docker-ubuntu1804-armv8l-4

Exception: org.jenkinsci.plugins.workflow.support.steps.AgentOfflineException: Unable to create live FilePath for test-docker-ubuntu1804-armv8l-4; test-docker-ubuntu1804-armv8l-4 was marked offline: Connection was broken

The failures happened about an hour apart, but the wording is about the same.

adamfarley commented 8 months ago

Two tests failed because they couldn't download renaissance.jar for performance runs, as part of (or just after) the liberty setup.

URL seems fine on my machine, so assuming it's a temporary upstream networking/server issue until future failures prove otherwise.

https://ci.adoptium.net/job/Test_openjdk11_hs_sanity.perf_x86-64_linux/884/console https://ci.adoptium.net/job/Test_openjdk11_hs_extended.perf_x86-64_linux/161/console

Both ran on test-equinix_esxi-ubuntu2204-x64-1

adamfarley commented 8 months ago

https://ci.adoptium.net/job/Test_openjdk21_hs_extended.openjdk_x86-64_mac_testList_1/31/console

Cannot contact test-orka-macos14-x64-96vzx: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
smlambert commented 8 months ago

https://github.com/adoptium/infrastructure/issues/3190#issuecomment-1841094904 - should be resolved by https://github.com/adoptium/aqa-tests/pull/4903 once merged.

sxa commented 8 months ago

Two tests failed because they couldn't download renaissance.jar for performance runs, as part of (or just after) the liberty setup. URL seems fine on my machine, so assuming it's a temporary upstream networking/server issue until future failures prove otherwise.

Those are both disk space issues, not networking ones.

18:03:21  tee: /home/jenkins/workspace/Test_openjdk11_hs_sanity.perf_x86-64_linux/aqa-tests/TKG/../TKG/output_compilation/compilation.log: No space left on device
18:07:00      [retry] Attempt [2]: error occurred; retrying...tee: /home/jenkins/workspace/Test_openjdk11_hs_extended.perf_x86-64_linux/aqa-tests/TKG/../TKG/output_compilation/compilation.log: No space left on device
adamfarley commented 8 months ago

As noted here, JDK8u is missing a large number of published binaries from the temurin8-binaries repo.

I think that is because the jdk8 build pipeline on the 11th timed out while uploading the binaries for some reason. Not sure if this was because of a hang or just slow uploads, as the upload job seems to lack regular time stamps.

Currently my plan is to ignore this unless it becomes a pattern.

adamfarley commented 1 month ago

Date: 7 Jun 2024 Host: build-siteox-solaris10u11-sparcv9-1 Error:

11:43:15  Exception: org.jenkinsci.plugins.workflow.support.steps.AgentOfflineException: Unable to create live FilePath for build-siteox-solaris10u11-sparcv9-1; build-siteox-solaris10u11-sparcv9-1 was marked offline: Connection was broken

URL: https://ci.adoptium.net/job/Test_openjdk8_hs_sanity.functional_sparcv9_solaris_testList_0/2/console

adamfarley commented 1 month ago

Four issues that look the same:

Date: 26 Jun 2024 Hosts: test-docker-ubi9-armv8l-1, test-docker-ubuntu2204-armv8-2, test-docker-ubuntu2310-armv8l-1, test-docker-ubuntu2204-armv8l-2 Error:

01:57:59  STF 00:57:58.849 - +------ Step 3 - Wait for processes to complete
01:57:59  STF 00:57:58.849 - | Wait for processes to meet expectations
01:57:59  STF 00:57:58.849 - |   Processes: [LT1, CL1]
01:57:59  STF 00:57:58.849 - |
01:57:59  STF 00:57:58.849 - Monitoring processes: CL1 LT1
01:58:02  CL1 j> 2024/06/27 00:58:00.679 ServerURL=service:jmx:rmi:///jndi/rmi://localhost:1234/jmxrmi
01:58:02  CL1 j> 2024/06/27 00:58:01.673 Attempting to connect
01:58:03  CL1 j> 2024/06/27 00:58:03.213 Monitored VM not ready at Jun 27, 2024, 12:58:03 AM (attempt 1, elapsed 1269ms).
01:58:03  CL1 j> 2024/06/27 00:58:03.215 Waiting 5 secs and trying again...
01:58:09  CL1 j> 2024/06/27 00:58:08.215 Attempting to connect
01:58:10  CL1 j> 2024/06/27 00:58:09.412 Connection established!
01:58:11  CL1 j> 2024/06/27 00:58:11.088 Starting to write data
02:00:18  Cannot contact test-docker-ubi9-armv8l-1: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@1ee8196a:test-docker-ubi9-armv8l-1": Remote call on test-docker-ubi9-armv8l-1 failed. The channel is closing down or has closed down

URLs:

adamfarley commented 1 month ago

Date: 26 Jun 2024 Hosts: test-docker-ubuntu2004-armv7l-5 Error:

02:00:16  Cannot contact test-docker-ubuntu2004-armv7l-5: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@50e98578:test-docker-ubuntu2004-armv7l-5": Remote call on test-docker-ubuntu2004-armv7l-5 failed. The channel is closing down or has closed down

URL: https://ci.adoptium.net/job/Test_openjdk11_hs_sanity.openjdk_arm_linux_testList_0/4/console

adamfarley commented 1 month ago

Date: 26 Jun 2024 Host: test-orka-macos14-x64-z7lvx Error:

21:40:32  Exception: org.jenkinsci.plugins.workflow.support.steps.AgentOfflineException: Unable to create live FilePath for test-orka-macos14-x64-z7lvx; test-orka-macos14-x64-z7lvx was marked offline: Connection was broken

URL: https://ci.adoptium.net/job/Test_openjdk21_hs_extended.openjdk_x86-64_mac_testList_1/13/console

adamfarley commented 1 month ago

Date: 26 Jun 2024 Hosts: test-docker-ubuntu2004-armv8l-1, test-docker-sles15-armv8l-1, test-docker-ubuntu2404-armv8-1, test-docker-debian12-armv8l-1, test-docker-fedora39-armv8l-1. Error:

02:00:24  Cannot contact test-docker-ubuntu2004-armv8l-1: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@601a06d8:test-docker-ubuntu2004-armv8l-1": Remote call on test-docker-ubuntu2004-armv8l-1 failed. The channel is closing down or has closed down

URLs:

adamfarley commented 1 month ago

Date 25 Jun 2024 Host: test-docker-debian12-armv7l-1 Error:

02:00:25  Cannot contact test-docker-debian12-armv7l-1: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@353dc87:test-docker-debian12-armv7l-1": Remote call on test-docker-debian12-armv7l-1 failed. The channel is closing down or has closed down

URL: https://ci.adoptium.net/job/Test_openjdk17_hs_extended.openjdk_arm_linux_testList_0/14/console

adamfarley commented 3 weeks ago

Date: 2024/08/02 Host: test-orka-macos14-x64-4ncp7 Error: 04:37:43 Cannot contact test-orka-macos14-x64-4ncp7: java.lang.InterruptedException URL: https://ci.adoptium.net/job/Test_openjdk23_hs_extended.openjdk_x86-64_mac/18/

Similar failures: Date: 2024/08/01 Host: test-orka-macos14-x64-97f7w 22:18:18 Cannot contact test-orka-macos14-x64-97f7w: java.lang.InterruptedException URL: https://ci.adoptium.net/job/Test_openjdk24_hs_extended.openjdk_x86-64_mac_testList_3/3/

adamfarley commented 2 weeks ago

Date: 2024/08/08 Host: build-marist-rhel8-s390x-1 Error:

21:44:52  Downloading GA release of boot JDK version 23 failed.
21:44:52  Attempting to download EA release of boot JDK version 23 from https://api.adoptium.net/v3/binary/latest/23/ea/linux/s390x/jdk/hotspot/normal/adoptium
21:44:52    % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
21:44:52                                   Dload  Upload   Total   Spent    Left  Speed
21:44:52  
21:44:52    0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
21:44:52    0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
21:44:52  
21:44:52    0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
21:44:52    0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
21:44:52  
21:44:52    5  192M    5 10.0M    0     0  13.9M      0  0:00:13 --:--:--  0:00:13 13.9M
21:44:52  curl: (18) transfer closed with 191282016 bytes remaining to read

URL: https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk/job/jdk-linux-s390x-temurin/290/

Date: 15 Aug 2024 Host: dockerhost-skytap-ubuntu2004-ppc64le-1 Error:

20:01:25  Attempting to download EA release of boot JDK version 23 from https://api.adoptium.net/v3/binary/latest/23/ea/linux/ppc64le/jdk/hotspot/normal/adoptium
20:01:25    % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
...
20:01:28    9  204M    9 20.0M    0     0  9106k      0  0:00:22  0:00:02  0:00:20 13.9M
20:01:28  curl: (18) transfer closed with 192960112 bytes remaining to read

URL: https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk/job/jdk-linux-ppc64le-temurin/300/console

adamfarley commented 1 week ago

Date: 8 Aug 2024 Host: dockerhost-skytap-ubuntu2204-x64-1 URLs:

The jobs before and after these ones show no sign of this issue. Will ignore for now, and raise a new issue if it occurs again in the future.