adoptium / infrastructure

This repo contains all information about machine maintenance.
Apache License 2.0
86 stars 101 forks source link

windocker: `script.sh.copy: No such file or directory` #3714

Open sxa opened 2 months ago

sxa commented 2 months ago

This is seen periodically in the windbld jobs - maybe just after error conditions on previous runs but that is not certain. It is often resolved by removing the C:\jw\workspace\build-scripts directory, although I have seen situations where I've done that, run another build which has failed, cleared it again and it works, so it's unclear if we're experience some delay somewhere in the clearup having the desired effect. The root cause of this error is currently unknown.

Noting that to run a test multiple times without taking up a full build cycle you can set "JAVA_TO_BUILD": "jdkXXu", in the job which will start the job but abort with an error about the java version after the point at which this failure occurs.

sxa commented 2 months ago

Noting that ls -l on the host shows that the user of the files under build-scripts\job\jdk21u\windbld@tmp\durable*` (including script.sh.copy) as the user that jenkins is running under on the host. When the same ls is run in a container it shows as Unknown+User:Unknown+Group. Files created within the container (such as the workspace directory under windbld) shows as ContainerUser:ContainerUser when viewed from inside the container. Confusingly, those also show as the same user that jenkins is running at when looked at on the host.

sxa commented 2 months ago

The attempt to use .gitconfig in C:\jw isn't working. If Iissue a git config --global -l from within the workflow I get a failure:

12:22:29  + git config --global -l
12:22:29  fatal: unable to read config file '/cygdrive/c/jw/.gitconfig': No such file or directory

If I issue that immediately after adding a safe.directory parameter with git config then it shows the correct value so it's using a git configuration from elsewhere at that point.

If I move it out of the way then it fails earlier in the pipeline:

12:47:36  [CHECKOUT] Checking out User Pipelines https://github.com/sxa/ci-jenkins-pipelines.git : windows_docker_support
[Pipeline] checkout
12:47:36  The recommended git tool is: git
12:47:36  No credentials specified
12:47:36  Warning: JENKINS-30600: special launcher org.jenkinsci.plugins.docker.workflow.WithContainerStep$Decorator$1@1f132a55; decorates hudson.plugins.cygpath.CygpathLauncherDecorator$1@cef8bfd will be ignored (a typical symptom is the Git executable not being run inside a designated container)
12:47:36  Cloning the remote Git repository
12:47:36  ERROR: Error cloning remote repo 'origin'
12:47:36  hudson.plugins.git.GitException: Command "git fetch --tags --force --progress -- https://github.com/sxa/ci-jenkins-pipelines.git +refs/heads/*:refs/remotes/origin/*" returned status code 128:

So for now it seems that this gitconfig file, and whatever it's using when I explicitly add in the safe.directory setting, are both required, so I'll leave both in place. Note that before I set the safe.directory options I have configured that the HOME variable is set to the jw directory via a sh -c set command.

sxa commented 2 months ago

Ref: https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk21u/job/windbld/

Note that after a successful (ish) jdk8u build (169) I had two consecutive failures trying to kick off jdk21u (170,171), but then the third one (172) passed that step without requiring the workspace to be moved out of the way.

sxa commented 2 months ago

Based on some investigations in https://github.com/adoptium/infrastructure/issues/3723 I tried changing the ownership of the @tmp directory so that it was definitely owned by ContainerUser but that didn't make a difference. The first time running after I explicitly removed the @tmp directory the job started to run through successfully. We will see if that is repeatable.

sxa commented 2 months ago

Answer: No. After jdk21u completed (subject to https://github.com/adoptium/infrastructure/issues/3709) in windbld run 242, jobs 244 and 245 failed, but the following 246 passed - all were run after clearing out the @tmp and cyclonedx-lib directories.

247 run afterwards then went straight through without problems (Again after removing those two directories).

So we still have inconsistencies. I'm thinking it would be nice to get a simple pipeline which starts a container and is able to demonstrate this, since out multi-thousand line monolith isn't ideal for problem reproduction/raising upstream,

sxa commented 2 months ago

I've just tested this using a standalone jenkins pipeline:

pipeline {
    agent any
    stages {
        stage('Test Docker on Windows') {
            agent { docker { image 'notrhel_build_image' } }
                steps {
                        println('Attempting to run commands in docker container')
                        sh(script: 'cmd /c echo Hello')
                        sh(script: 'hostname')
                        sh(script: 'ls -l c:/')
                        sh(script: 'ls -l c:/workspace')
                        sh(script: 'ls -l c:/workspace/workspace')
                        sh(script: 'ls -l c:/workspace/workspace/windtest')
                }
        }
   }
}

Running a sequence of jobs I had the error after a varying amount of failures: 5,1,6,0,0,1,1,0,0,0,0 (The 6 passed all of them!!)

sxa commented 2 months ago

Running the same jobs with bat() instead of sh() appears to pass reliably. Intriguing ...

sxa commented 1 month ago

Noting also that having git bash in the path first (before the Cygwin one) makes no difference - the error still occurs.

sxa commented 1 month ago

Memo to self: We have some functions executed in Windows pipelines that are run on either Windows or UNIX systems depending on the pipeline - specifically the writeMetadata function https://github.com/adoptium/ci-jenkins-pipelines/blob/4bfdbb67722dd7e96b256511ac6586e749650524/pipelines/build/common/openjdk_build_pipeline.groovy#L1280

I'm going to leave this with sh in these cases for now, and switch attention to another PR.

sxa commented 1 month ago

Memo to self: We have some functions executed in Windows pipelines that are run on either Windows or UNIX systems depending on the pipeline

Now sorted that case - using an isUnix() test which actually tests for "Not Windows" as the machine it's running on which is much better than my previous check which tested whether we were doing a windows build in a docker pipeline. (I hadn't spotted that built-in function previously)