adoptium / infrastructure

This repo contains all information about machine maintenance.
Apache License 2.0
84 stars 100 forks source link

System unavailable: Jenkins failing to initiate new jobs correctly. #3552

Closed sxa closed 2 months ago

sxa commented 2 months ago

Here is an example from an attempt to run a Grinder (#9858 but it doesn't seem to matter):

There have been no recent plugin updates.

Full log details are in the next two collapsed sections for reference:

Here is the full log from that Grinder job showing the full exception track: ``` Started by user [Stewart X Addison](https://ci.adoptium.net/user/sxa) Checking out git ${ADOPTOPENJDK_REPO} into /home/jenkins/.jenkins/workspace/Grinder@script/7d272c0688f17ab4e5b2f6ce77a7dc9cf4df33ff05c3a95eddd38682ef795b79 to read openjdk-tests/buildenv/jenkins/openjdk_tests The recommended git tool is: git No credentials specified Cloning the remote Git repository Using shallow clone with depth 1 Cloning repository https://github.com/adoptium/aqa-tests.git > git init /home/jenkins/.jenkins/workspace/Grinder@script/7d272c0688f17ab4e5b2f6ce77a7dc9cf4df33ff05c3a95eddd38682ef795b79/openjdk-tests # timeout=10 ERROR: Error cloning remote repo 'origin' hudson.plugins.git.GitException: Could not init /home/jenkins/.jenkins/workspace/Grinder@script/7d272c0688f17ab4e5b2f6ce77a7dc9cf4df33ff05c3a95eddd38682ef795b79/openjdk-tests at org.jenkinsci.plugins.gitclient.CliGitAPIImpl$5.execute(CliGitAPIImpl.java:1073) at org.jenkinsci.plugins.gitclient.CliGitAPIImpl$2.execute(CliGitAPIImpl.java:819) at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1222) at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1305) at org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:129) at org.jenkinsci.plugins.workflow.cps.CpsScmFlowDefinition.create(CpsScmFlowDefinition.java:165) at org.jenkinsci.plugins.workflow.cps.CpsScmFlowDefinition.create(CpsScmFlowDefinition.java:71) at org.jenkinsci.plugins.workflow.job.WorkflowRun.run(WorkflowRun.java:311) at hudson.model.ResourceController.execute(ResourceController.java:101) at hudson.model.Executor.run(Executor.java:442) Caused by: hudson.plugins.git.GitException: Error performing git command: git init /home/jenkins/.jenkins/workspace/Grinder@script/7d272c0688f17ab4e5b2f6ce77a7dc9cf4df33ff05c3a95eddd38682ef795b79/openjdk-tests at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:2858) at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:2762) at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:2757) at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommand(CliGitAPIImpl.java:2051) at org.jenkinsci.plugins.gitclient.CliGitAPIImpl$5.execute(CliGitAPIImpl.java:1071) ... 9 more Caused by: java.io.IOException: Cannot run program "git" (in directory "/home/jenkins/.jenkins/workspace/Grinder@script/7d272c0688f17ab4e5b2f6ce77a7dc9cf4df33ff05c3a95eddd38682ef795b79/openjdk-tests"): error=0, Failed to exec spawn helper: pid: 2568427, exit value: 1 at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1143) at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1073) at hudson.Proc$LocalProc.(Proc.java:252) at hudson.Proc$LocalProc.(Proc.java:221) at hudson.Launcher$LocalLauncher.launch(Launcher.java:994) at hudson.Launcher$ProcStarter.start(Launcher.java:506) at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:2835) ... 13 more Caused by: java.io.IOException: error=0, Failed to exec spawn helper: pid: 2568427, exit value: 1 at java.base/java.lang.ProcessImpl.forkAndExec(Native Method) at java.base/java.lang.ProcessImpl.(ProcessImpl.java:314) at java.base/java.lang.ProcessImpl.start(ProcessImpl.java:244) at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1110) ... 19 more ERROR: Error cloning remote repo 'origin' ERROR: Maximum checkout retry attempts reached, aborting Finished: FAILURE ```
Entry in jenkins server log from the above job with the full `WARNING` line - `2024-05-04 13:34:05.581+0000 [id=4373048] WARNING o.j.p.w.flow.FlowExecutionList#unregister: Owner[Grinder/9858:Grinder #9858] was not in the list to begin with: [Owner[build-scripts/utils/betaTrigger_21ea/49:build-scripts/utils/betaTrigger_21ea #49], Owner[build-scripts/openjdk21-pipeline/272:build-scripts/openjdk21-pipeline #272], Owner[build-scripts/jobs/jdk21u/jdk21u-linux-riscv64-temurin/33:build-scripts/jobs/jdk21u/jdk21u-linux-riscv64-temurin #33], Owner[build-scripts/utils/pipeline_jobs_generator_jdk21u/190:build-scripts/utils/pipeline_jobs_generator_jdk21u #190], Owner[AQA_Test_Pipeline/243:AQA_Test_Pipeline #243], Owner[Test_openjdk21_hs_extended.perf_x86-64_linux/51:Test_openjdk21_hs_extended.perf_x86-64_linux #51], Owner[Test_openjdk22_hs_extended.system_x86-64_linux/46:Test_openjdk22_hs_extended.system_x86-64_linux #46], Owner[Test_openjdk21_hs_extended.openjdk_x86-64_linux/52:Test_openjdk21_hs_extended.openjdk_x86-64_linux #52], Owner[Test_openjdk21_hs_extended.system_x86-64_linux/171:Test_openjdk21_hs_extended.system_x86-64_linux #171], Owner[Test_openjdk21_hs_sanity.system_x86-64_linux/173:Test_openjdk21_hs_sanity.system_x86-64_linux #173], Owner[Test_openjdk21_hs_sanity.perf_x86-64_linux/169:Test_openjdk21_hs_sanity.perf_x86-64_linux #169], Owner[Test_openjdk11_hs_sanity.functional_x86-64_linux/530:Test_openjdk11_hs_sanity.functional_x86-64_linux #530], Owner[Test_openjdk11_hs_special.functional_x86-64_linux/191:Test_openjdk11_hs_special.functional_x86-64_linux #191], Owner[Test_openjdk8_hs_extended.system_x86-64_linux/1176:Test_openjdk8_hs_extended.system_x86-64_linux #1176], Owner[Test_openjdk8_hs_sanity.system_x86-64_linux/1179:Test_openjdk8_hs_sanity.system_x86-64_linux #1179], Owner[Test_openjdk21_hs_extended.functional_x86-64_linux/161:Test_openjdk21_hs_extended.functional_x86-64_linux #161], Owner[Test_openjdk8_hs_extended.functional_x86-64_linux/613:Test_openjdk8_hs_extended.functional_x86-64_linux #613], Owner[Test_openjdk21_hs_special.functional_x86-64_linux/49:Test_openjdk21_hs_special.functional_x86-64_linux #49], Owner[Test_openjdk21_hs_sanity.functional_x86-64_linux/163:Test_openjdk21_hs_sanity.functional_x86-64_linux #163], Owner[Test_openjdk21_hs_sanity.openjdk_x86-64_linux/189:Test_openjdk21_hs_sanity.openjdk_x86-64_linux #189], Owner[Test_openjdk8_hs_sanity.openjdk_x86-64_linux/1199:Test_openjdk8_hs_sanity.openjdk_x86-64_linux #1199], Owner[Test_openjdk11_hs_sanity.system_x86-64_linux/917:Test_openjdk11_hs_sanity.system_x86-64_linux #917], Owner[Test_openjdk8_hs_extended.perf_x86-64_linux/177:Test_openjdk8_hs_extended.perf_x86-64_linux #177], Owner[Test_openjdk11_hs_extended.system_x86-64_linux/897:Test_openjdk11_hs_extended.system_x86-64_linux #897], Owner[Test_openjdk8_hs_sanity.functional_x86-64_linux/614:Test_openjdk8_hs_sanity.functional_x86-64_linux #614], Owner[Test_openjdk8_hs_sanity.perf_x86-64_linux/1179:Test_openjdk8_hs_sanity.perf_x86-64_linux #1179], Owner[Test_openjdk11_hs_extended.openjdk_x86-64_linux/186:Test_openjdk11_hs_extended.openjdk_x86-64_linux #186], Owner[Test_openjdk11_hs_extended.functional_x86-64_linux/493:Test_openjdk11_hs_extended.functional_x86-64_linux #493], Owner[Test_openjdk11_hs_sanity.openjdk_x86-64_linux/967:Test_openjdk11_hs_sanity.openjdk_x86-64_linux #967], Owner[Test_openjdk11_hs_sanity.perf_x86-64_linux/916:Test_openjdk11_hs_sanity.perf_x86-64_linux #916], Owner[Test_openjdk8_hs_special.functional_x86-64_linux/709:Test_openjdk8_hs_special.functional_x86-64_linux #709], Owner[Test_openjdk8_hs_extended.openjdk_x86-64_linux/182:Test_openjdk8_hs_extended.openjdk_x86-64_linux #182], Owner[Test_openjdk11_hs_extended.perf_x86-64_linux/184:Test_openjdk11_hs_extended.perf_x86-64_linux #184], Owner[build-scripts/release-openjdk17-pipeline/65:build-scripts/release-openjdk17-pipeline #65], Owner[build-scripts/jobs/release/jobs/jdk17u/jdk17u-release-linux-riscv64-temurin/1:build-scripts/jobs/release/jobs/jdk17u/jdk17u-release-linux-riscv64-temurin #1], Owner[Test_openjdk17_hs_extended.openjdk_riscv64_linux/17:Test_openjdk17_hs_extended.openjdk_riscv64_linux #17], Owner[Test_openjdk17_hs_extended.openjdk_riscv64_linux_testList_2/4:Test_openjdk17_hs_extended.openjdk_riscv64_linux_testList_2 #4]]`

Running manually gives no problems:

$ id
uid=1000(jenkins) gid=1000(jenkins) groups=1000(jenkins)
$ ls -ld /home/jenkins/.jenkins/workspace/Grinder@script/7d272c0688f17ab4e5b2f6ce77a7dc9cf4df33ff05c3a95eddd38682ef795b79/openjdk-tests
drwxr-xr-x 2 jenkins jenkins 4096 May  4 15:23 /home/jenkins/.jenkins/workspace/Grinder@script/7d272c0688f17ab4e5b2f6ce77a7dc9cf4df33ff05c3a95eddd38682ef795b79/openjdk-tests
$ ls -al /home/jenkins/.jenkins/workspace/Grinder@script/7d272c0688f17ab4e5b2f6ce77a7dc9cf4df33ff05c3a95eddd38682ef795b79/openjdk-tests
total 8
drwxr-xr-x 2 jenkins jenkins 4096 May  5 11:57 .
drwxr-xr-x 3 jenkins jenkins 4096 May  5 11:57 ..
$ 

Running git init /home/jenkins/.jenkins/workspace/Grinder@script/7d272c0688f17ab4e5b2f6ce77a7dc9cf4df33ff05c3a95eddd38682ef795b79/openjdk-tests does not show a problem

There is no obvious performance problem based on the logs from the last week: image

sxa commented 2 months ago

As the system is relatively idle other than jobs to support https://github.com/adoptium/infrastructure/issues/3501#issuecomment-2093272575 and one Playbook check job which was in part to verify whether non-pipeline jobs were affected (they are not) I'm going to trigger a jenkins restart (Time 1022UTC)

sxa commented 2 months ago
Looks to be happier after the restart - Grinder 9861 kicked off without issues ``` Started by user [Stewart X Addison](https://ci.adoptium.net/user/sxa) Checking out git ${ADOPTOPENJDK_REPO} into /home/jenkins/.jenkins/workspace/Grinder@script/7d272c0688f17ab4e5b2f6ce77a7dc9cf4df33ff05c3a95eddd38682ef795b79 to read openjdk-tests/buildenv/jenkins/openjdk_tests The recommended git tool is: git No credentials specified > git rev-parse --resolve-git-dir /home/jenkins/.jenkins/workspace/Grinder@script/7d272c0688f17ab4e5b2f6ce77a7dc9cf4df33ff05c3a95eddd38682ef795b79/openjdk-tests/.git # timeout=10 Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/adoptium/aqa-tests.git # timeout=10 Cleaning workspace > git rev-parse --verify HEAD # timeout=10 No valid HEAD. Skipping the resetting > git clean -fdx # timeout=10 Pruning obsolete local branches Using shallow fetch with depth 1 Fetching upstream changes from https://github.com/adoptium/aqa-tests.git > git --version # timeout=10 > git --version # 'git version 2.35.1' > git fetch --tags --force --progress --prune --depth=1 -- https://github.com/adoptium/aqa-tests.git +refs/heads/*:refs/remotes/origin/* # timeout=60 > git rev-parse origin/master^{commit} # timeout=10 JENKINS-19022: warning: possible memory leak due to Git plugin usage; see: https://plugins.jenkins.io/git/#remove-git-plugin-buildsbybranch-builddata-script Checking out Revision f0319c150c6ec8d6b92659321370dc7f0ccb762f (origin/master) > git config core.sparsecheckout # timeout=10 > git checkout -f f0319c150c6ec8d6b92659321370dc7f0ccb762f # timeout=10 Commit message: "Exclude TestHandshake in JDK17 and JDK21 (#5279)" > git rev-list --no-walk f0319c150c6ec8d6b92659321370dc7f0ccb762f # timeout=10 [Pipeline] Start of Pipeline [Pipeline] timestamps [Pipeline] { [Pipeline] echo SPEC: linux_x86-64 [Pipeline] echo LABEL: ci.role.test&&hw.arch.x86&&sw.os.linux [Pipeline] stage [Pipeline] { (Queue) [Pipeline] nodesByLabel Found a total of 12 nodes with the 'ci.role.test&&hw.arch.x86&&sw.os.linux' label [Pipeline] echo dynamicAgents: [azure, fyre] [Pipeline] node Running on [test-docker-debian12-x64-3](https://ci.adoptium.net/computer/test%2Ddocker%2Ddebian12%2Dx64%2D3/) in /home/jenkins/workspace/Grinder [...] ```

On the basis of this I'm going to close this issue. Noting that we have an update cycle planned for this Thursday so hopefully it will behave until then.

sxa commented 1 month ago

Noting this may have been due to an update to the Temurin JDK that happened a few days ago https://issues.jenkins.io/browse/JENKINS-72665?focusedId=445724&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-445724 (May 4th at 0501)