Open olbapjose opened 7 months ago
I am in the same situation with 1.5-debian10.
Thank you for the report. We are addressing this issue with the highest priority.
The fix https://github.com/GoogleCloudDataproc/initialization-actions/pull/1161 for gpu init actions has been verified. We are already working on the same fix patch for other init actions which are failing with the same error.
For urgent fix, customers/developers can clone the init action and add the same lines of code as in the fix in their copy, and use it for cluster creation. Please note that we do not encourage our customers to use cloned init script as they will not have updated init actions, and they will have to clone it every time there is a change in the init actions repository. So unless urgent, please wait for the other fixes to go in :)
@akhanna213 I just tried using the latest version of the install_gpu_driver.sh and just went through the process of create a dataproc cluster through the UI and setting that latest version of the driver and I am still running into initialization issues
@akhanna213 @cjac I have run the command and it is still failing. Could you please provide an update? It is very important for us to have this up and running. I am using --image-version 2.0-debian10
which I know is a bit old but I don't think it is related to the issue, correct?
Thanks
Hi @ahmedetefy @olbapjose could you confirm if the error message is still the same. We have already rolled out the fix a while back.
@akhanna213 Please see the image below and the attachment, which is the output file mentioned in the error.
google-cloud-dataproc-metainfo_initialization-script-0_output.txt
Long story short, the error says 'Unable to update packages lists.'
@akhanna213 Yes I can confirm the error is still there
To reproduce the error is quite straightforward
gcloud dataproc clusters create cluster-e485 --enable-component-gateway --bucket <bucket_name> --region <your-region> --single-node --master-machine-type n1-standard-8 --master-boot-disk-type pd-balanced --master-boot-disk-size 500 --master-accelerator type=nvidia-tesla-t4 --image-version <any 2.1 or above image version> --optional-components JUPYTER --initialization-actions '< gcs_path to latest install GPU driver script >' --project <project_name>
I have also had issues with 2.0-ubuntu18
(even though it succeeds in installing the GPU drivers sometimes)
And the following are the error logs if it helps
E: Repository 'https://packages.cloud.google.com/apt google-cloud-logging-bionic-all InRelease' changed its 'Codename' value from 'google-cloud-logging-stretch-all' to 'google-cloud-logging-bionic-all'
Hi @ahmedetefy @olbapjose , this looks like a different issue than what the users were facing earlier. Let me check with the team to understand what is causing this breakage. Appreciate your patience on this, let me get back to you as soon as possible.
Hi @akhanna213 do you have updates on this? Initially I was able to do a workaround by adding --allow-releaseinfo-change:
function update_apt_get() {
retry_apt_command "apt-get update --allow-releaseinfo-change"
}
and it worked, but today it is failing again with a different message:
The following NEW packages will be installed: gnupg2 0 upgraded, 1 newly installed, 0 to remove and 3 not upgraded. Need to get 393 kB of archives. After this operation, 411 kB of additional disk space will be used. Err:1 http://deb.debian.org/debian buster/main amd64 gnupg2 all 2.2.12-1+deb10u1 404 Not Found [IP: 151.101.22.132 80] E: Failed to fetch http://deb.debian.org/debian/pool/main/g/gnupg2/gnupg2_2.2.12-1+deb10u1_all.deb 404 Not Found [IP: 151.101.22.132 80] E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
I will try again with fix-missing but looks like the script is not robust as it is exposed to different possible points of failure.
Very recently, Dataproc clusters started to fail at creation, due to an error in the Kafka initialization script, caused by a Debian repository no longer available:
https://deb.debian.org/debian buster-backports Release
The error says:
The contents of that file is the following. Any advice or workaround is more than welcome.