adoptium / infrastructure

This repo contains all information about machine maintenance.
Apache License 2.0
85 stars 101 forks source link

System unavailable: build-alibaba-win2012r2-x64-[12] #1818

Closed sxa closed 3 years ago

sxa commented 3 years ago

This will prevent alibaba windows builds working as they are currently tied to these machines.

Willsparker commented 3 years ago

-1 is back as I've come to it, I'll look at -2 :-)

Willsparker commented 3 years ago

Rather interestingly, neither of the machines have the Jenkins Agent installed as a service. They appeared to be running the agent in a cygwin terminal window. I'll install it on both

Willsparker commented 3 years ago

I can't install them on both, due to the lack of IcedTea-Web. I can install it, but it seems that the machines have a stripped down version of the playbook running on them, and I'm unsure of the reason for that. (asked about it here: https://adoptopenjdk.slack.com/archives/C53GHCXL4/p1610356948467700 ) In the meanwhile, I'll get the Jenkins agent running in a Cygwin terminal again, so they're at least usable.

Willsparker commented 3 years ago

@Haroon-Khel said he's installing the missing packages on the machines (ref: https://adoptopenjdk.slack.com/archives/C53GHCXL4/p1610356948467700 ).

Haroon-Khel commented 3 years ago

Missing packages have been installed on both alibaba machines, except for OpenSSL packages. Both experienced the error

TASK [Install OpenSSL-1.1.1i 64-bit (VS2013)] ******************************************************************************************************************************************
task path: /Users/hkhel/AdoptOpenJDK/openjdk-infrastructure/ansible/playbooks/AdoptOpenJDK_Windows_Playbook/roles/OpenSSL/tasks/main.yml:73
fatal: [8.208.87.18]: FAILED! => {"changed": true, "cmd": "set PATH=C:\\Strawberry\\perl\\bin;C:\\openjdk\\nasm-2.13.03;%PATH% && .\\vcvarsall.bat AMD64 && cd C:\\temp\\OpenSSL-1.1.1i && perl C:\\temp\\OpenSSL-1.1.1i\\Configure VC-WIN64A --prefix=C:\\openjdk\\OpenSSL-1.1.1i-x86_64-VS2013 && nmake install > C:\\temp\\openssl64-VS2013.log &&
 nmake -f makefile clean", "delta": "0:00:03.448210", "end": "2021-01-11 12:34:31.186987", "msg": "non-zero return code", "rc": 1, "start": "2021-01-11 12:34:27.738777", "stderr": "'nmake' is not recognized as an internal or external command,\r\noperable program or batch file.\r\n", "stderr_lines": ["'nmake' is not recognized as an internal or external command,", "operable program or batch file."], "stdout": "The specified configuration type is missing.  The tools for the\r\nconfiguration might not be installed.\r\nConfiguring OpenSSL version 1.1.1i (0x1010109fL) for VC-WIN64A\r\nUsing os-specific seed configuration\r\nCreating configdata.pm\r\n

Looking into it

sxa commented 3 years ago

We're having some issuejs on these machines after (a) running the rest of the playbooks and (b) Switching the jenkins agent to run as the jenkins user. While most of them have now been resolved I'm still getting the following issue (even after a reboot) on -1 which I haven't yet been able to fully diagnose ... Still working on it but any crazy ideas welcome :-)

17:00:23  Running gradle with /cygdrive/c/openjdk/jdk-11 at /cygdrive/c/workspace/openjdk-build/workspace/.gradle
17:00:23  Exception in thread "main" java.io.FileNotFoundException: \cygdrive\c\workspace\openjdk-build\workspace\.gradle\wrapper\dists\gradle-6.5-bin\6nifqtx7604sqp1q6g8wikw7p\gradle-6.5-bin.zip.lck (Access is denied)
Haroon-Khel commented 3 years ago

OpenSSL 64 bit VS2013 also isnt installed on either -1 or -2 due to vcvarsall.bat not being available in in the VS2013 folders. Reinstalling VS2013 didnt seem to solve this

Willsparker commented 3 years ago

@sxa Have you tried running it with a different JDK (or reinstalled JDK11) ? Presuming you've already looked at all the permissions of the folders and everything.

Haroon-Khel commented 3 years ago

Latest failure https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk11u/job/jdk11u-windows-x64-dragonwell/38/console Still the same error, but running the same build command on a cygwin shell, as the jenkins user, on build-alibaba-win2012r2-x64-1 in an rdp session doesnt seem to hit this error

Haroon-Khel commented 3 years ago

I changed the variable CYGWIN_WORKSPACE to C:\Users\Jenkins\workspace (it was C:\Jenkins\workspace before). This may have done the trick https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk11u/job/jdk11u-windows-x64-hotspot/884/console (the hotspot builds were failing for the same reason too)

Haroon-Khel commented 3 years ago

https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk11u/job/jdk11u-windows-x64-dragonwell/40/console A dragonwell build on alibaba -1 passed, but failed at the installer stage. I think the variable change helped to circumvent the gradle error

Haroon-Khel commented 3 years ago

Running the dragonwell jdk8 job on alibaba-1, https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk8u/job/jdk8u-windows-x64-dragonwell/46/console, jenkins seems to have a problem with clearing the C:\Users\Jenkins\workspace workspace

Haroon-Khel commented 3 years ago

Re ran the jdk11 dragonwell job on alibaba-1, same error https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk11u/job/jdk11u-windows-x64-dragonwell/41/console. Oddly this wasnt a problem yesterday when I ran both jdk11 hotspot and dragonwell jobs on the same machine one after the other

Haroon-Khel commented 3 years ago

I deleted the C:\Users\Jenkins\workspace directory. I re ran the jdk11 hotspot and dragonwell and jdk 8 dragonwell jobs one after the other. Jenkins didnt seem to complain about not being able to delete workspaces. The CYGWIN_WORKSPACE variable is still C:\Users\Jenkins\workspace for alibaba-1

Haroon-Khel commented 3 years ago

Regarding the 2013 compiler on alibaba-2, jdk 8 hotspot can build fine. jdk 8 dragonwell exits with this error

Note: Some input files use or override a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Note: Some input files use unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
27 errors
make[2]: *** [CompileJavaClasses.gmk:336: /cygdrive/c/cygwin/home/jenkins/openjdk-build/workspace/build/src/build/windows-x86_64-normal-server-release/jdk/classes/_the.BUILD_JDK_batch] Error 1
make[1]: *** [BuildJdk.gmk:64: classes-only] Error 2
make: *** [/home/jenkins/openjdk-build/workspace/build/src//make/Main.gmk:117: jdk-only] Error 2
sxa commented 3 years ago

Looking at https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk8u/job/jdk8u-windows-x64-dragonwell/53/consoleFull I think that might be the same error occurring on one of the other build machines, so it could well be a problem in the codebase at the moment as opposed to a problem with that machine, so at least for now I wouldn't worry too much about that error.

Haroon-Khel commented 3 years ago

Just an update: It was identified that the alibaba machines are having the same problem as https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1662, in which the leftover _the.. file prevents jenkins from deleting the workspace before running its job. This has affected other windows boxes, hence the pr https://github.com/AdoptOpenJDK/openjdk-build/pull/2204, so I have put in a similar pr https://github.com/AdoptOpenJDK/openjdk-build/pull/2400.

Related issue https://github.com/AdoptOpenJDK/openjdk-build/issues/2205

Haroon-Khel commented 3 years ago

I have also changed the CYGWIN_WORKSPACE variable on both alibaba machines to C:\Jenkins\temp since C:\Jenkins\workspace results in the gradle error

17:00:23  Running gradle with /cygdrive/c/openjdk/jdk-11 at /cygdrive/c/workspace/openjdk-build/workspace/.gradle
17:00:23  Exception in thread "main" java.io.FileNotFoundException: \cygdrive\c\workspace\openjdk-build\workspace\.gradle\wrapper\dists\gradle-6.5-bin\6nifqtx7604sqp1q6g8wikw7p\gradle-6.5-bin.zip.lck (Access is denied)
sxa commented 3 years ago

Gut feel at this point is that it's a party length issue so I suspect any directory with 9 characters like "workspace" would have the issue. We could switch to C:\workspace which works be more consistent with what we have on the other machines

Willsparker commented 3 years ago

If you change the workspace variable to C:\workspace , I think the build PR will be unnecessary, as it should be covered by https://github.com/AdoptOpenJDK/openjdk-build/blob/102237341c7f0737f0dd4dc57fcc7e9e3ffe3bd5/pipelines/build/common/openjdk_build_pipeline.groovy#L915 .

Haroon-Khel commented 3 years ago

To reiterate my comments in the build pr, changing the workspace to C:\workspace\openjdk-build (the rm command removes workspaces in C:\workspace\openjdk-build not C:\workspace\) caused the gradle error to appear again. It's possible that your gut feeling is right @sxa, since C:\Jenkins\temp as a workspace seemed to work fine (I assume its fine if the total path exceeds 9 characters, so long as each directory doesnt?)

sxa commented 3 years ago

Yep it's the total length that'll make a difference, each individual component shouldn't matter too much (there may be a limit but I haven't hit it before in any normal scenario)

sxa commented 3 years ago

@Haroon-Khel If you set the workspace to just C:\workspace does it work ok? I think openjdk-build will get created as a subdirectory of the workspace dir so setting th workspace to c:\workspace\openjdk-build definitely wouldn't have the deisred effect of taking advantage of the line suggested by Will

Haroon-Khel commented 3 years ago

Setting it to C:\workspace didnt work. It hit the same 'could not delete workspace error', (https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk8u/job/jdk8u-windows-x64-dragonwell/79/console). I even tried C:/workspace and it hit the same error.

Also, I dont think the openjdk-build directory is created on its own as a subdirectory (but I may be wrong); by setting the workspace to C:\workspace, the subdirectories that are created are C:\workspace\workspace\build ... which therefore wont be deleted by the existing rm command, which deletes rm -rf C:/workspace/openjdk-build/workspace/build/src/build/*/jdk/gensrc. Its only if I set the workspace to be C:\workspace\openjdk-build\ do I get C:\workspace\openjdk-build\workspace\build ... (unless I am terribly mistaken).

(EDIT: I might be slightly mistaken, but I am certain the openjdk-build subdirectory does not get created automatically since I cant find it in any of the workspaces used aside from when I used C:\workspace\openjdk-build as the workspace)

Another note, I cant find a single windows machine in jenkins with the CYGWIN_WORKSPACE variable to be C:\workspace, so I am not sure why https://github.com/AdoptOpenJDK/openjdk-build/blob/102237341c7f0737f0dd4dc57fcc7e9e3ffe3bd5/pipelines/build/common/openjdk_build_pipeline.groovy#L915 was put there in the first place, unless those machines have since been decommissioned

Haroon-Khel commented 3 years ago

https://github.com/AdoptOpenJDK/openjdk-build/issues/1855#issue-636830242 Ahh, Softlayer machines which we dont have anymore

sxa commented 3 years ago

AdoptOpenJDK/openjdk-build#1855 (comment) Ahh, Softlayer machines which we dont have anymore

They were replaced with the -ibmcloud- ones.

sxa commented 3 years ago

Setting it to C:\workspace didnt work. It hit the same 'could not delete workspace error', (https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk8u/job/jdk8u-windows-x64-dragonwell/79/console). I even tried C:/workspace and it hit the same error.

We need to understand why - looks like you've changed it back to be the openjdk-buildsubdirectory. I'm going to reset it and try again.

sxa commented 3 years ago

(EDIT: I might be slightly mistaken, but I am certain the openjdk-build subdirectory does not get created automatically since I cant find it in any of the workspaces used aside from when I used C:\workspace\openjdk-build as the workspace)

Worth remembering that setting it to something that long causes other problems, so we know that won't work.

Unclear why the default directory used for deletion included the openjdk-build if that's the case though. Suspect we may want to adjust the defult so it excludes the openjdk-build but it might be safer for now to just add a new delete as you've done in the PR.

Just running a test just now and I've queued up another one using c:\Jenkins\temp as per your PR and make it use your branch with the extra rm. You have access to run that job if needed

sxa commented 3 years ago

The last run mentioned in the previous comment looks ok, as does a subsequent one on the machine from the same c:/Jenkins/temp directory using your PR.

Haroon-Khel commented 3 years ago

The build-ibmcloud machines use E:/jenkins/tmp as their CYGWIN_WORKSPACE (hence the need for https://github.com/AdoptOpenJDK/openjdk-build/blob/102237341c7f0737f0dd4dc57fcc7e9e3ffe3bd5/pipelines/build/common/openjdk_build_pipeline.groovy#L919) so I dont think that rm command is of any use anymore

sxa commented 3 years ago

Slightly confused by the last comment - you don't think which rm command is of use?

Haroon-Khel commented 3 years ago

Sorry, the command which deletes the C:\workspace\openjdk-build\ ... directory, since I think this was specific to the softlayer machines

sxa commented 3 years ago

Yes agreed your new line could probably replace that one

Haroon-Khel commented 3 years ago

Can this issue be closed now? Or do we want to use this issue to discuss the mysterious _the.. file?

sxa commented 3 years ago

Can this issue be closed now? Or do we want to use this issue to discuss the mysterious _the.. file?

We haven't yet enabled all the tags to build etc. overnight so it is still "unavailable". I've ran a couple of test jobs though (https://ci.adoptopenjdk.net/view/Test_system/job/Test_openjdk15_j9_sanity.system_x86-64_windows/ 152 and 153 - one on each machine) and that looks ok, so I think I'll re-enable all the tags and let it run tonight on the JDK16 builds and see what happens. I'm also running https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk8u/job/jdk8u-windows-x64-hotspot/907 on alibaba-1 to test VS2013 but it looks like it doesn't have a valid JDK7 boot dir configured

Haroon-Khel commented 3 years ago

but it looks like it doesn't have a valid JDK7 boot dir configured

Im not sure why this is, just yesterday I ran a jdk8 dragonwell job which made it past that stage on alibaba-1 https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk8u/job/jdk8u-windows-x64-dragonwell/78/consoleFull

sxa commented 3 years ago

Dragonwell 8 does not use a JDK7 boot JDK. Hotspot does. We need this machine to be able to build all variants including HotSpot

Haroon-Khel commented 3 years ago

Ok, that would explain it

Haroon-Khel commented 3 years ago

Ive changed the JDK7_BOOT_DIR on alibaba-1 to /cygdrive/c/openjdk/jdk-7 (the - was missing). Done the same on alibaba-2

sxa commented 3 years ago

There was another problem showing up whereby the builds would fail if the jenkins agent was running as a service relative to starting it from a cygwin shell. The difference is that the default PATH on the system had the Windows GIT client first, whereas the cygwin shell had it's own one first. Adjusting the system PATH to have C:\cygwin\bin first resolved that problem, therefore the machines now have it running correctly as a service.

sxa commented 3 years ago
sxa commented 3 years ago

@Haroon-Khel Ref the first of the job links above, it looks like alibaba-1 doesn't have cmake on it - I thought you'd run most of the playbooks on it - is that not the case? https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk11u/job/jdk11u-windows-x64-openj9-windowsXL/635/console

sxa commented 3 years ago

JDK17/HS didn't run properly either :-(

23:22:58  Building targets 'product-images legacy-jre-image test-image' in configuration 'windows-x86_64-server-release'
23:22:58  Compiling 8 files for BUILD_TOOLS_LANGTOOLS
23:22:58  error: file not found: \cygdrive\c\jenkins\temp\workspace\build\src\build\windows-x86_64-server-release\buildtools\langtools_tools_classes\_the.BUILD_TOOLS_LANGTOOLS_batch.filelist
23:22:58  make[3]: *** [ToolsLangtools.gmk:37: /cygdrive/c/jenkins/temp/workspace/build/src/build/windows-x86_64-server-release/buildtools/langtools_tools_classes/_the.BUILD_TOOLS_LANGTOOLS_batch] Error 3
23:22:58  make[2]: *** [make/Main.gmk:74: buildtools-langtools] Error 2
23:22:58  make[2]: *** Waiting for unfinished jobs....
23:23:01  
Haroon-Khel commented 3 years ago

@Haroon-Khel Ref the first of the job links above, it looks like alibaba-1 doesn't have cmake on it - I thought you'd run most of the playbooks on it - is that not the case? https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk11u/job/jdk11u-windows-x64-openj9-windowsXL/635/console

Im certain I did. Will look into it

Haroon-Khel commented 3 years ago

I ran the cmake role on -1. Though ansible said it installed cmake, I couldnt find it on the machine, nor did it update the path. So I manually installed cmake in Program Files\CMake and added it to the path. I was unable to install it manually in cygwin/bin due to the installer not having sufficient privileges (eventhough I was running it as the Administrator user)

It should be noted that ansible checks for an already installed cmake in cygwin64\bin, while these machines have only a cygwin\bin directory. Would it suffice simply to rename the directory to cygwin64? Or must cygwin be reinstalled completely for this?

Haroon-Khel commented 3 years ago

Rerunning a jdk11 openj9 job on -1. If it succeeds, ill install cmake this way onto -2 https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk11u/job/jdk11u-windows-x64-openj9/907/console

Haroon-Khel commented 3 years ago

The job made it passed the cmake check, but failed with this

12:25:48  Compiling 13 properties into resource bundles for jdk.javadoc
12:25:49  /usr/bin/bash: /cygdrive/c/Program: No such file or directory
12:25:49  make[3]: *** [/cygdrive/c/Jenkins/temp/workspace/build/src/closed/OpenJ9.gmk:414: /cygdrive/c/Jenkins/temp/workspace/build/src/build/windows-x86_64-normal-server-release/vm/cmake.stamp] Error 127
12:25:49  make[2]: *** [/cygdrive/c/Jenkins/temp/workspace/build/src/closed/custom/Main.gmk:51: j9vm-build] Error 2
12:25:49  make[2]: *** Waiting for unfinished jobs....
12:25:49  Compiling 19 properties into resource bundles for jdk.compiler
12:25:49  Compiling 12 properties into resource bundles for jdk.jdeps
12:25:57  
12:25:57  ERROR: Build failed for targets 'product-images legacy-jre-image test-image debug-image' in configuration 'windows-x86_64-normal-server-release' (exit code 2) 
12:25:57  
12:25:57  No indication of failed target found.
12:25:57  Hint: Try searching the build log for '] Error'.
12:25:57  Hint: See doc/building.html#troubleshooting for assistance.
12:25:57  
12:25:57  make[1]: *** [/cygdrive/c/Jenkins/temp/workspace/build/src/make/Init.gmk:305: main] Error 2
12:25:57  make: *** [/cygdrive/c/Jenkins/temp/workspace/build/src/make/Init.gmk:186: product-images] Error 2
Haroon-Khel commented 3 years ago

The 12:25:49 /usr/bin/bash: /cygdrive/c/Program: No such file or directory suggests that its having trouble with the spacing in windows directory names

Willsparker commented 3 years ago

Program Files (or the x86 equivalent) doesn't have a Short Name, would be my guess. See #1250 #1598 #1672 (and their referenced issues)

Haroon-Khel commented 3 years ago
02/02/2021  08:07 PM    <DIR>          PROGRA~1     Program Files
01/14/2021  06:44 PM    <DIR>          PROGRA~2     Program Files (x86)

They do seem to have Shortnames enabled