adoptium / infrastructure

This repo contains all information about machine maintenance.
Apache License 2.0
84 stars 100 forks source link

Ansible request for cuda lib(s) in the docker build images #3597

Open AdamBrousseau opened 2 weeks ago

AdamBrousseau commented 2 weeks ago

Details:

OpenJ9 compiles require an Nvidia Cuda lib (or few) on Linux (ppc64le and x64) in order to compile with cuda support. There is a mechanism in the Adopt pipelines to add on to the Adopt build image by rebuilding with the libs copied from the nvidia container[1][2][3]. I believe this is a different requirement from the tests needing the cuda tookit install [4] (Related #3581). I suspect that when the PBs were setup and the build scripts were originally written, the requirement was thought to be one in the same or there was maybe enough confusion that we ended up adding a skip-tag to the cuda role when we build the docker build images[5] in order to minimize the image size.

My proposal is that another PB is created that just adds those few lib(s) we need for compile machines/containers and we don't need to skip it for the docker builds. This would allow us to not do the workaound in the build pipelines. It would also allow us to use the build images in our other set of OpenJ9 pipeline builds without having to build in this extra mechanism to add the lib(s) on the fly. At the moment we maintain one of our own containers with it built in but we'd like to switch over to Adopt's container.

cc @keithc-ca

Slack with @sxa on this topic https://adoptium.slack.com/archives/C09NW3L2J/p1717421779597569

[1] https://github.com/ibmruntimes/ci-jenkins-pipelines/blob/191f1ffe1fdc96b94a15035e1fd5361ce7659ce7/pipelines/jobs/configurations/jdk11u_pipeline_config.groovy#L25 [2] https://github.com/ibmruntimes/ci-jenkins-pipelines/blob/ibm/pipelines/build/dockerFiles/cuda.dockerfile [3] https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk11u/job/jdk11u-linux-x64-openj9/1410/consoleFull

13:00:28  + docker build -t build-image --build-arg image=adoptopenjdk/centos6_build_image -f pipelines/build/dockerFiles/cuda.dockerfile .
13:00:28  #1 [internal] load build definition from cuda.dockerfile
13:00:28  #1 sha256:0a0eb7e469a0c517623acabc5e231ee077b40648782661db525e22b550a01850
13:00:28  #1 transferring dockerfile: 469B done
13:00:28  #1 DONE 0.0s
13:00:28  
13:00:28  #3 [internal] load metadata for docker.io/adoptopenjdk/centos6_build_image:latest
13:00:28  #3 sha256:e8101b55035147fccdd44222b10f6fa3b709e62699296c30c6d94702b66fae29
13:00:28  #3 DONE 0.0s
13:00:28  
13:00:28  #2 [internal] load metadata for docker.io/nvidia/cuda:9.0-devel-ubuntu16.04
13:00:28  #2 sha256:e57ed62e41cf45db9af2c6b41cb181b618b777c5b492d0ce0331f40cbf47633d
13:00:28  #2 DONE 0.0s
13:00:28  
13:00:28  #4 [internal] load .dockerignore
13:00:28  #4 sha256:ca2b5769b850bc70000aee595f4b4f843ee5da9de00d2d176d35b227c62f4f42
13:00:28  #4 transferring context: 2B done
13:00:28  #4 DONE 0.0s
13:00:28  
13:00:28  #9 [stage-0 1/4] FROM docker.io/adoptopenjdk/centos6_build_image:latest
13:00:28  #9 sha256:3db1d566653c6d887b0b20b567d6b23bfe339ea0df110d0d557fddb1ef937994
13:00:28  #9 CACHED
13:00:28  
13:00:28  #7 FROM docker.io/nvidia/cuda:9.0-devel-ubuntu16.04
13:00:28  #7 sha256:286cda7a99a95805fd7694229d9c65777e83cad7730791ba3ea32a7991f89efd
13:00:28  #7 CACHED
13:00:28  
13:00:28  #8 [stage-0 2/4] RUN mkdir -p /usr/local/cuda-9.0/nvvm
13:00:28  #8 sha256:1a21c9fa453a13fe0ac554eb26721b832f363d9943a7f908cc07fa29e5ae2f6a
13:03:21  #8 DONE 170.5s
13:03:21  
13:03:21  #6 [stage-0 3/4] COPY --from=nvidia/cuda:9.0-devel-ubuntu16.04 /usr/local/cuda-9.0/include /usr/local/cuda-9.0/include
13:03:21  #6 sha256:0b72122c224181e435d90b1f1de8bed3df71a1cf77cce2faeb57474abeece190
13:03:21  #6 DONE 0.2s
13:03:21  
13:03:21  #5 [stage-0 4/4] COPY --from=nvidia/cuda:9.0-devel-ubuntu16.04 /usr/local/cuda-9.0/nvvm/include /usr/local/cuda-9.0/nvvm/include
13:03:21  #5 sha256:1930658618775c009b98dd89ce1d724a6c4267db9e139c6ec617744e545787d0
13:03:21  #5 DONE 0.0s
13:03:21  
13:03:21  #10 exporting to image
13:03:21  #10 sha256:bc9feab4fe4df18c7ae2feaa7b1e9798da77512cb963c00a3c11a634f918d8ca
13:03:21  #10 exporting layers
13:03:21  #10 exporting layers 0.3s done
13:03:21  #10 writing image sha256:75cc0c2bd8d7f2132bc552719ac3297cf419e29d3ec8d0c7095f8e23993dc0c6 done
13:03:21  #10 naming to docker.io/library/build-image:latest done
13:03:21  #10 DONE 0.3s
[Pipeline] }
[Pipeline] // withEnv
[Pipeline] isUnix
[Pipeline] withEnv
[Pipeline] {
[Pipeline] sh
13:03:22  + docker inspect -f . build-image
13:03:22  .
[Pipeline] }
[Pipeline] // withEnv
[Pipeline] withDockerContainer
13:03:22  dockerhost-equinix-ubuntu2204-x64-1 does not seem to be running inside a container
13:03:23  $ docker run -t -d -u 1000:1000 -w /home/jenkins/workspace/build-scripts/jobs/jdk11u/jdk11u-linux-x64-openj9 -v /home/jenkins/workspace/build-scripts/jobs/jdk11u/jdk11u-linux-x64-openj9:/home/jenkins/workspace/build-scripts/jobs/jdk11u/jdk11u-linux-x64-openj9:rw,z -v /home/jenkins/workspace/build-scripts/jobs/jdk11u/jdk11u-linux-x64-openj9@tmp:/home/jenkins/workspace/build-scripts/jobs/jdk11u/jdk11u-linux-x64-openj9@tmp:rw,z -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** build-image cat
13:03:48  $ docker top 2985a9ff451031d13d6809bf49fe05ec96090fcfa8d2d2e070b186b2bb9cfeac -eo pid,comm

[4] https://github.com/adoptium/infrastructure/blob/master/ansible/playbooks/AdoptOpenJDK_Unix_Playbook/roles/NVidia_Cuda_Toolkit/tasks/main.yml

[5] https://github.com/adoptium/infrastructure/blob/c96f2d57b511e888cd465e01a7433199b776ab73/ansible/docker/Dockerfile.CentOS7#L15

keithc-ca commented 2 weeks ago

OpenJ9 builds require no libraries (.so files), only a subset of CUDA header files (see the COPY --from=nvidia/cuda:9.0-devel-ubuntu16.04 ... line above).

Test machines only need the CUDA driver (libcuda.so and any requisite kernel module) and the runtime library (libcudartNN.so).

AdamBrousseau commented 2 weeks ago

Right, sorry. Header files not libs.

AdamBrousseau commented 2 weeks ago

Line in the pipeline where the docker image gets built https://github.com/ibmruntimes/ci-jenkins-pipelines/blob/d439f31275a0da6510ca946e91ea0738742df368/pipelines/build/common/openjdk_build_pipeline.groovy#L2428C44-L2428C188