adoptium / infrastructure

This repo contains all information about machine maintenance.
Apache License 2.0
85 stars 101 forks source link

New Machine requirement: Windows dockerBuild containers #3286

Closed sxa closed 1 month ago

sxa commented 10 months ago

I need to request a new machine:

Please explain what this machine is needed for: Running builds in an isolated way where we can achieve SLSA build level 3 compliance on Windows along with the other primary platforms. Ideally we'll be able to create windows-on-windows container images which we share and then download and run the builds in.

As background info:

So the tasks required would be:

Once this level of analysis and expertise is gained it will likely make windows installer testing, or any other such activities simpler and give us more options moving forward.

Related for historic reference:

RadekCap commented 7 months ago

Please, assign this task to me. Thank you.

sxa commented 3 months ago

Of the three options listed on the Microsoft website:

sxa commented 3 months ago

OK First phase done ...

(Also, for my own notes, to debug powershell scripts use Set-PSDebug -Trace 2)

sxa commented 3 months ago

Playbook execution notes:

sxa commented 3 months ago

ansible can be run on the host to point at the container if you install cygwin which has ansible as one of its installable options (You probably want to include git too if it's a clean install on the host system). Noting that if you use localhost/127.0.0.1 in your hosts file you should specify -e git_sha=12345 or something appropriate otherwise the execution will trip up over https://github.com/adoptium/infrastructure/blob/4aa7788325c224484f99aa1ae000f117e9b081d7/ansible/playbooks/AdoptOpenJDK_Windows_Playbook/roles/logs/tasks/main.yml#L14 Noting that WSL could probably be used too, but that requires a system with virtualization extension instructions to be available which is not the case on all systems.

sxa commented 2 months ago

Latest attempt is with: --skip-tags adoptopenjdk,reboot,MSVS_2013,MSVS_2017,NTP_TIME (Note: MSVS_2013 is because I didn't have the installer on the machine, 2017 did not work, could also add Dragonwell to skip that install which is not required for Temurin. Playbook changes to make it complete:

After ansible run is complete, run the commands shown in this article

docker ps
docker stop <image>
docker commit <image> win2022_build_image

After which it can be started again and used

sxa commented 2 months ago

docker commit didn't work on my image: Error response from daemon: re-exec error: exit status 1: output: mkdir \\?\C:\Windows\SystemTemp\hcs376450290\Files: Access is denied This is specific to the new image which has had the playbook run on it and does not occur when attempting to commit a image with only basic changes applied.

EDIT: This seems to be the temporary location where it is storing the entire image before it is committed and the machine ran out of space.

Noting that outside that directory most of the docker data is stored in C:\ProgramData\docker

EDIT 2: The docker commit command on the second machine which had adequate space used around 95GB of space in C:\windows\SystemTemp to perform the commit (excluded VS2013 and 2017) and took about 40 minutes at 40-50Mb/sec showing on resource monitor, followed by about 10 minutes of using another 15GB on C: then moving data back to the docker directory at a faster rate (Maybe ~100Mb/sec)

It did, however, hit an error Error response from damon: re-execx error: exit status 1: output: hcsshim::IpmportLayer failed in Win32: Access is denied. (0x5) (Probably hit a zero disk space condition on C: since DOCKER_TMPDIR apparently isn't working to relocate that since docker 25)

sxa commented 2 months ago

This is unfortunate. The builds aren't working because it looks like the automatic shortname generation (fsutil behavior set disable8.3 0) does not appear to be working within the container but is mandatory for the openjdk build process. Directories can have a shortname created manually with fsutil file setshortname "Long name" shortname but that is not ideal to do for each possible path.

EDIT: Noting that https://github.com/adoptium/infrastructure/blob/master/ansible/playbooks/AdoptOpenJDK_Windows_Playbook/roles/shortNames/tasks/main.yml already has some explicit short name creation.

sxa commented 2 months ago

Manually created a few of the shortnames that the configure step was objecting to and I have a JDK21u build complete in a container, so this seems feasible 👍🏻

sxa commented 2 months ago

Noting that we should look at doing this with the MS build tools installer which is suitable for use by Open Source projects. The jdk21u builds currently use:

10:04:20  * C Compiler:     Version 19.37.32822 (at /cygdrive/c/progra~1/micros~3/2022/commun~1/vc/tools/msvc/1437~1.328/bin/hostx64/x64/cl.exe)
10:04:20  * C++ Compiler:   Version 19.37.32822 (at /cygdrive/c/progra~1/micros~3/2022/commun~1/vc/tools/msvc/1437~1.328/bin/hostx64/x64/cl.exe)

Other references (this numbering is more confiusing that I realised - I thought we only had the '2022' vs '19.xx' versioning differences to worry about before today...)

sxa commented 2 months ago

Noting that we should look at doing this with the MS build tools installer which is suitable for use by Open Source projects. The jdk21u builds currently use:

10:04:20  * C Compiler:     Version 19.37.32822 (at /cygdrive/c/progra~1/micros~3/2022/commun~1/vc/tools/msvc/1437~1.328/bin/hostx64/x64/cl.exe)
10:04:20  * C++ Compiler:   Version 19.37.32822 (at /cygdrive/c/progra~1/micros~3/2022/commun~1/vc/tools/msvc/1437~1.328/bin/hostx64/x64/cl.exe)

Other references (this numbering is more confiusing that I realised - I thought we only had the '2022' vs '19.xx' versioning differences to worry about before today...)

sxa commented 2 months ago

Struggling with the GPG role at the moment which is called during the ANT role (I'm getting gnupg as a requirement which supplies gpg2 instead of gpg). Also Wix has to be skipped as I don't have ansible.builtin.runs available.

Other than that a two-phase dockerfile is looking quite promising. The first sets up WinRM (will only be invoked locally) and installs cygwin with git and ansible, then triggers a reboot to ensure the cygwin path takes effect.

The second runs the playbooks as normal, although for now I've currently it running in multiple layers for performance of testing to allow the caching of each layer to take effect independently:

  1. --skip-tags adoptopenjdk,reboot,ANT,NTP_TIME,Wix,MSVS_2013,MSVS_2017,MSVS_2019,MSVS_2022
  2. -t ANT
  3. -t MSVS_2019
  4. -t MSVS_2022

This is currently using the playbook branch at https://github.com/sxa/infrastructure/tree/sxa_allhosts which makes a few changes to support this execution.

sxa commented 2 months ago

The above approach seemed to work yesterday now that the machine is rebooted after adding cygwin to the PATH and I had a system which was able to successfully build jdk21u using two dockerfiles (First to configure WinRM, the second to run the playbooks using the individual layers from the previous comment. Next steps as follows:

Noting that the image without VS2013 or 2017 is 99GB in size.

sxa commented 2 months ago

Now fixed the path setting so that it only requires one dockerfile so we have something consistent with what we have on Linux now 👍🏻

It still currently requires a username/password for the authentication, but the password can be passed into the dockerfile with --build arg PW=SomeAcceptablePassword on the docker build command.

I haven't got it picking up the git_sha properly yet so that is currently hard-coded. Everything else is good enough to be able to run a jdk21u build on, but it's missing the compilers for some earlier versions (Will need those on the host and mapped in via Vendor_Files, similar to what we do with AWX). Also we'll want the jenkins_user role (Currently skipped via adoptopenjdk unless we're happy with the processes running as an administrator within the container (Need to check how well user mapping works in these containers)

Otherwise, here is the dockerfile Dockerfile.win2022v2.txt which uses the playbook changes from https://github.com/sxa/infrastructure/tree/windows_docker_fixes

sxa commented 2 months ago

VS2013 install appears to complete OK (Based on the logs in C:\Windows\SystemTemp - more detailed logs are in C:\Temp) but the playbook doesn't terminate that role so it never continues.

Sizes: Version Path Total file size on file system
VS2022 C:\Program Files\Microsoft Visual Studio\2022 19.7G
VS2019 C:\Program Files\Microsoft Visual Studio\2019 12.5G
VS2017? C:\Program Files\Microsoft Visual Studio 14.0 2.3G
n/a C:\Program Files (x86)\Windows Kits 14G (+7GB with VS2017)
n/a C:\Program Files (x86)\Microsoft SDKs 5.8G

NOTE: The playbooks set up with the dockerfile excluding all the visual studio installations produces a docker image which is 15.4G in size

NOTE 2: If the machine runs out of disk space on C: during a commit phase, there will be hcs* directories left under C:\Windows\SystemTemp which should be removed manually.

sxa commented 2 months ago

Steps to set up:

From there you can run this to start the container:

Then go through the normal build process:

sxa commented 2 months ago

Based on https://github.com/adoptium/temurin-build/issues/2922#issuecomment-2269480488 we may be able to switch to using Visual Studio 2022 for everything which would significantly reduce the windows installation requirements. The dockerfile is currently set up to only install VS2022 and not the other versions.

sxa commented 1 month ago

Next bullet on the list is to: Integrate this into the build pipelines

Initial attempts using a jenkins workspace directory with a drive on F: failed because the jenkins docker failed to map it into F: in the container as there was only a C: drive. Switched the workspace directory to C:\jenkins-workspace and we hit path limits:

10:03:48  configure: error: Your base path is too long. It is 112 characters long, but only 100 is supported

Now moved to using C:\ws for the directory and it seems to be progressing well: Machine: dockerhost-azure-win2022-x64-1 (temporary, called sxa-win2022-3 in the Azure console) Build job: https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk21u/job/jdk21u-windows-x64-docker/

Noting that I have had errors like this and while I have not identified the exact cause, clearing out the build-scripts directory in the workspace resolves it:

10:24:33  Checked out HEAD commit SHA:
[Pipeline] sh
10:24:34  sh: c:/jw/workspace/build-scripts/jobs/jdk21u/jdk21u-windows-x64-docker@tmp/durable-34ace7f2/script.sh.copy: No such file or directory`

The build (both jdk8u and jdk21u) then failed later on with another path length issue. I have therefore shortened the name of the job to windbld (Windows Docker Build) and the build has run through to completion. This will need further investigation but it's a good position at which to end the week :-) I've had to make some changes in the build repository to make this work (most specifically using git config --global safe.directory /cygdrive/c/jw/workspace/build-scripts/jobs/jdk21u/windbld in openjdk_build_pipeline.groovy to avoid errors such as the one in https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk21u/job/windbld/35/console:

10:14:10  + git clean -fdx
10:14:10  fatal: detected dubious ownership in repository at '/cygdrive/c/jw/workspace/build-scripts/jobs/jdk21u/jdk21u-windows-x64-docker'
10:14:10  To add an exception for this directory, call:
10:14:10  
10:14:10    git config --global --add safe.directory /cygdrive/c/jw/workspace/build-scripts/jobs/jdk21u/jdk21u-windows-x64-docker

Successful builds in jenkins with windbld job name:

sxa commented 1 month ago

Job https://ci.adoptium.net/job/win2022_docker_image_updater/label=dockerhost-azure-win2022-x64-1/ is being prototyped to create the docker image. It is a stripped down copy of the rhel7/s390x one and will save to win2022_notrhel_image on the host for now, and as per earlier comments it does not include the infrastructure SHA.

sxa commented 1 month ago

Summary

With the initial feasibility done, I'm going to leave this closed and create follow-on items for the subsequent tasks and the outstanding items on the list:

Jenkins job refs:

sxa commented 1 month ago

Note: The HOME environment variable set when the jenkins agent is started is significant, as it affects where git picks up the .gitconfig from during the pipeline checkout on the host. On the current machine I'm using for testing this is set in the startjenkins.sh script before the agent is started.

sxa commented 1 month ago

Above PR should fix the issue with long file names - I'm doing some extra tests to verify with my current job and have also initiated https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk21u/job/jdk21u-windows-x64-temurin/138/console to test with the full job name. It should be good with the PR in place as it's using the same logic for overriding the default workspace location as we use in the non-docker situation on Windows.

Note that as part of this I have switched from using the C:\jw directory for the top level jenkins home on the docker host machine to C:\workspace for consistency with the non-docker case.

sxa commented 1 month ago

For my own reference - the build times on the docker machine (Not as powerful as the main build machines - it's 2 core / 8GiB) are:

Version Time for 2-core docker build Typical time on Azure 4-core machine
jdk8u 52m 31m
jdk11u 2h14 1h31
jdk17u 2h20 1h27
jdk21u 2h32 1h29
jdk24 1h45
sxa commented 1 month ago

First build using the main pipelines on the dockerhost machine: https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk21u/job/jdk21u-windows-x64-temurin/151/ "NODE_LABEL": "dockerhost-azure-win2022-x64-1", "DOCKER_IMAGE": "notrhel_build_image", USER_REMOTE_CONFIGS:

{
    "branch": "docker_windows_shortpath",
    "remotes": {
        "url": "https://github.com/sxa/ci-jenkins-pipelines.git"
    }
}

DEFAULTS_JSON:

        "pipeline_branch": "docker_windows_shortpath",
        "pipeline_url": "https://github.com/sxa/ci-jenkins-pipelines.git",`
sxa commented 1 week ago

It's been quite a lot of work but the sign_Verification job now has a working run after a refactor of the code that does the signing and assembly within the pipelines. Ref: https://github.com/adoptium/infrastructure/issues/3709#issuecomment-2373386390 A bit of cleaning up, and then verifying that it can create reproducible builds, will mean this can go in as a PR.

sxa commented 1 week ago

--create-sbom wasn't working as ant is not in the PATH on the machine. For now I've added that to the path of the environment variables in the jenkins machine definition, but that's probably something we want to cover in the container image setup.