Open santhoshvly opened 1 week ago
This is due to apt-key add -
deprecation enforcement. The trust databases need to be separated into their own files and referenced by path in the sources.list file for each repo. I have an impementation complete in my rapids branch. I could integrate into master.
https://github.com/cjac/initialization-actions/blob/rapids-20240806/gpu/install_gpu_driver.sh#L1077
Santosh, did you say you've tried this workaround and that it's unblocked you?
Please review and test #1240
@cjac Yes, I tried with the workaround script you mentioned but still breaking with similar error in Dataproc 2.2
-----END PGP PUBLIC KEY BLOCK-----'
sed -i -e 's:deb https:deb [signed-by=/usr/share/keyrings/mysql.gpg] https:g' /etc/apt/sources.list.d/mysql.list rm -rf /etc/apt/trusted.gpg main is_debian ++ os_id ++ cut -d= -f2 ++ grep '^ID=' /etc/os-release ++ xargs [[ debian == \d\e\b\i\a\n ]] remove_old_backports is_debian12 is_debian ++ os_id ++ xargs ++ cut -d= -f2 ++ grep '^ID=' /etc/os-release [[ debian == \d\e\b\i\a\n ]] ++ os_version ++ xargs ++ cut -d= -f2 ++ grep '^VERSION_ID=' /etc/os-release [[ 12 == \1\2* ]] return is_debian ++ os_id ++ xargs ++ cut -d= -f2 ++ grep '^ID=' /etc/os-release [[ debian == \d\e\b\i\a\n ]] export DEBIAN_FRONTEND=noninteractive DEBIAN_FRONTEND=noninteractive execute_with_retries 'apt-get install -y -qq pciutils linux-headers-6.1.0-25-cloud-amd64' local -r 'cmd=apt-get install -y -qq pciutils linux-headers-6.1.0-25-cloud-amd64' (( i = 0 )) (( i < 3 )) eval 'apt-get install -y -qq pciutils linux-headers-6.1.0-25-cloud-amd64' ++ apt-get install -y -qq pciutils linux-headers-6.1.0-25-cloud-amd64 E: Error, pkgProblemResolver::Resolve generated breaks, this may be caused by held packages. sleep 5 (( i++ )) (( i < 3 )) eval 'apt-get install -y -qq pciutils linux-headers-6.1.0-25-cloud-amd64' ++ apt-get install -y -qq pciutils linux-headers-6.1.0-25-cloud-amd64 E: Error, pkgProblemResolver::Resolve generated breaks, this may be caused by held packages. sleep 5 (( i++ )) (( i < 3 )) eval 'apt-get install -y -qq pciutils linux-headers-6.1.0-25-cloud-amd64' ++ apt-get install -y -qq pciutils linux-headers-6.1.0-25-cloud-amd64 E: Error, pkgProblemResolver::Resolve generated breaks, this may be caused by held packages. sleep 5 (( i++ )) (( i < 3 )) return 1
@cjac I have disabled secure boot in dataproc. Is that okay or should we enable it for this workaround?
to use secure boot, you'll need to build a custom image. Instructions here:
https://github.com/GoogleCloudDataproc/custom-images/tree/master/examples/secure-boot
You do not need secure boot enabled for the workaround to function. I think you may just be missing an apt-get update after the sources.list files are cleaned up and the trust keys are written to /usr/share/keyrings
package cache update command included in #1240 as commit 234515d674b73ce8f191184c950535975fc5acaf
@cjac I tried with that but still breaking with same error
I forgot that I'm pinned to 2.2.20-debian12
I'll try to make it work with the latest from the 2.2 line.
Okay, Thank you. I am getting the error in, 2.2.32-debian12.
this might do it:
if is_debian ; then
clean_up_sources_lists
apt-get update
export DEBIAN_FRONTEND="noninteractive"
echo "Begin full upgrade"
date
apt-get --yes -qq -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" full-upgrade
date
echo "End full upgrade"
pkgs="$(apt-get -y full-upgrade 2>&1 | grep -A9 'The following packages have been kept back:' | grep '^ ')"
apt-get install -y --allow-change-held-packages -qq ${pkgs}
fi
yes, that last iteration does seem to get the installer working for me on 2.2 latest
@cjac Thank you. I tried with the above changes but the cluster creation still failed. It didn't give the previous package installation error and looks good in init script logs, last few lines of install_gpu_dirver.sh script below:-
pdate-alternatives: using /usr/lib/mesa-diverted to provide /usr/lib/glx (glx) in auto mode Processing triggers for initramfs-tools (0.142+deb12u1) ... update-initramfs: Generating /boot/initrd.img-6.1.0-25-cloud-amd64 Processing triggers for libc-bin (2.36-9+deb12u8) ... Processing triggers for man-db (2.11.2-2) ... Processing triggers for glx-alternative-mesa (1.2.2) ... update-alternatives: updating alternative /usr/lib/mesa-diverted because link group glx has changed slave links Setting up glx-alternative-nvidia (1.2.2) ... Processing triggers for glx-alternative-nvidia (1.2.2) ... Setting up nvidia-alternative (550.54.14-1) ... Processing triggers for nvidia-alternative (550.54.14-1) ... update-alternatives: using /usr/lib/nvidia/current to provide /usr/lib/nvidia/nvidia (nvidia) in auto mode Setting up nvidia-kernel-support (550.54.14-1) ... Setting up libnvidia-ml1:amd64 (550.54.14-1) ... Setting up nvidia-smi (550.54.14-1) ... Processing triggers for nvidia-alternative (550.54.14-1) ... update-alternatives: updating alternative /usr/lib/nvidia/current because link group nvidia has changed slave links Setting up nvidia-kernel-open-dkms (550.54.14-1) ... Loading new nvidia-current-550.54.14 DKMS files... Building for 6.1.0-25-cloud-amd64 Building initial module for 6.1.0-25-cloud-amd64
I am seeing the following error in Dataproc logs:-
DEFAULT 2024-09-27T02:58:49.624652770Z Setting up xserver-xorg-video-nvidia (560.35.03-1) ...
DEFAULT 2024-09-27T02:58:49.807012159Z Redundant argument in sprintf at /usr/share/perl5/Debconf/Element/Noninteractive/Error.pm line 54,
I think this error caused the cluster creation failure.
@cjac We are unable to create dataproc GPU cluster since Dataproc 2.1/2/2 upgrade . Please let me know if there are any workaround to proceed with cluster creation.
I did publish another version since last we spoke. Can you please review the code at https://github.com/GoogleCloudDataproc/initialization-actions/pull/1240/files please? The tests passed last commit but took 2 hours and one minute to complete. This latest update should reduce the runtime significantly.
I received those messages as well, but they should just be warnings. Does the new change get things working?
@cjac I tried the latest script but dataproc initialization action is breaking with timeout error and cluster is not starting:-
name: "gs://syn-development-kub/syn-cluster-config/install_gpu_driver.sh" type: INIT_ACTION state: FAILED start_time { seconds: 1727708007 nanos: 938000000 } end_time { seconds: 1727708408 nanos: 209000000 } error_detail: "Initialization action timed out. Failed action \'gs://syn-development-kub/syn-cluster-config/install_gpu_driver.sh\', see output in: gs://syn-development-kub/google-cloud-dataproc-metainfo/20d0767a-6c0a-4eea-a0de-6ba1cc16207a/dataproc-22-gpu-test-691fd61a-a3ec9b72-w-0/dataproc-initialization-script-0_output" error_code: TASK_FAILED
I couldn't find any error details in the init script output. I am attaching the init script output for your reference. google-cloud-dataproc-metainfo_20d0767a-6c0a-4eea-a0de-6ba1cc16207a_dataproc-22-gpu-test-691fd61a-a3ec9b72-w-0_dataproc-initialization-script-0_output.txt
Can you increase your timeout by 5-10 minutes? I do have a fix that's in the works for the base image, and once it gets published, we should be able to skip the full upgrade in the init action.
Here is a recent cluster build I did in my repro lab. It took 14m47.946s:
Fri Sep 27 04:49:21 PM PDT 2024
+ gcloud dataproc clusters create cluster-1718310842 --master-boot-disk-type pd-ssd --worker-boot-disk-type pd-ssd --secondary-worker-boot-disk-type pd-ssd --num-masters=1 --num-workers=2 --master-boot-disk-size 100 --worker-boot-disk-size 100 --secondary-worker-boot-disk-size 50 --master-machine-type n1-standard-16 --worker-machine-type n1-standard-16 --master-accelerator type=nvidia-tesla-t4 --worker-accelerator type=nvidia-tesla-t4 --region us-west4 --zone us-west4-a --subnet subnet-cluster-1718310842 --no-address --service-account=sa-cluster-1718310842@cjac-2021-00.iam.gserviceaccount.com --tags=tag-cluster-1718310842 --bucket cjac-dataproc-repro-1718310842 --enable-component-gateway --metadata install-gpu-agent=true --metadata gpu-driver-provider=NVIDIA --metadata public_secret_name=efi-db-pub-key-042 --metadata private_secret_name=efi-db-priv-key-042 --metadata secret_project=cjac-2021-00 --metadata secret_version=1 --metadata modulus_md5sum=d41d8cd98f00b204e9800998ecf8427e --metadata dask-runtime=yarn --metadata bigtable-instance=cjac-bigtable0 --metadata rapids-runtime=SPARK --initialization-actions gs://cjac-dataproc-repro-1718310842/dataproc-initialization-actions/gpu/install_gpu_driver.sh,gs://cjac-dataproc-repro-1718310842/dataproc-initialization-actions/dask/dask.sh,gs://cjac-dataproc-repro-1718310842/dataproc-initialization-actions/rapids/rapids.sh --initialization-action-timeout=90m --metadata bigtable-instance=cjac-bigtable0 --no-shielded-secure-boot --image-version 2.2 --max-idle=8h --scopes https://www.googleapis.com/auth/cloud-platform,sql-admin
Waiting on operation [projects/cjac-2021-00/regions/us-west4/operations/094ca004-2e9f-32f6-94e1-53c8f6799624].
Waiting for cluster creation operation...⠛
WARNING: Consider using Auto Zone rather than selecting a zone manually. See https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/auto-zone
Waiting for cluster creation operation...done.
Created [https://dataproc.googleapis.com/v1/projects/cjac-2021-00/regions/us-west4/clusters/cluster-1718310842] Cluster placed in zone [us-west4-a].
real 14m47.946s
user 0m4.854s
sys 0m0.426s
+ date
Fri Sep 27 05:04:09 PM PDT 2024
I see that I hard-coded a regional bucket path into the code. this will slow things down when running outside of us-west4 ; I'll fix that next.
@cjac Adding timeout fixed the error and created cluster. We are able to run GPU workloads in the cluster. Thank you so much for the support!!.
Glad I could help!
I'll work on getting these changes integrated into the base image.
Hi,
I am trying to attach GPUs to Dataproc 2.2 cluster, but it is breaking and cluster creation failing. Secure boot is disabled and I am using the latest install_gpu_driver.sh from this repository. I am getting the following error during cluster initialization now:-
++ tr '[:upper:]' '[:lower:]' ++ lsb_release -is
return 1
Please let me know if I am missing anything or is there any work around to proceed further?