GoogleCloudDataproc / initialization-actions

Run in all nodes of your cluster before the cluster starts - lets you customize your cluster
https://cloud.google.com/dataproc/init-actions
Apache License 2.0
588 stars 512 forks source link

initialization actions which use apt-get update fail due to purged oldoldstable backports repository #1157

Open olbapjose opened 7 months ago

olbapjose commented 7 months ago

Very recently, Dataproc clusters started to fail at creation, due to an error in the Kafka initialization script, caused by a Debian repository no longer available:

https://deb.debian.org/debian buster-backports Release

The error says:

The contents of that file is the following. Any advice or workaround is more than welcome.

image
kishida-yuki commented 7 months ago

I am in the same situation with 1.5-debian10.

cjac commented 7 months ago

Thank you for the report. We are addressing this issue with the highest priority.

akhanna213 commented 7 months ago

The fix https://github.com/GoogleCloudDataproc/initialization-actions/pull/1161 for gpu init actions has been verified. We are already working on the same fix patch for other init actions which are failing with the same error.

For urgent fix, customers/developers can clone the init action and add the same lines of code as in the fix in their copy, and use it for cluster creation. Please note that we do not encourage our customers to use cloned init script as they will not have updated init actions, and they will have to clone it every time there is a change in the init actions repository. So unless urgent, please wait for the other fixes to go in :)

ahmedetefy commented 6 months ago

@akhanna213 I just tried using the latest version of the install_gpu_driver.sh and just went through the process of create a dataproc cluster through the UI and setting that latest version of the driver and I am still running into initialization issues

olbapjose commented 6 months ago

@akhanna213 @cjac I have run the command and it is still failing. Could you please provide an update? It is very important for us to have this up and running. I am using --image-version 2.0-debian10 which I know is a bit old but I don't think it is related to the issue, correct?

Thanks

akhanna213 commented 6 months ago

Hi @ahmedetefy @olbapjose could you confirm if the error message is still the same. We have already rolled out the fix a while back.

olbapjose commented 6 months ago

@akhanna213 Please see the image below and the attachment, which is the output file mentioned in the error.

image

google-cloud-dataproc-metainfo_initialization-script-0_output.txt

Long story short, the error says 'Unable to update packages lists.'

ahmedetefy commented 6 months ago

@akhanna213 Yes I can confirm the error is still there

To reproduce the error is quite straightforward

gcloud dataproc clusters create cluster-e485 --enable-component-gateway --bucket <bucket_name> --region <your-region> --single-node --master-machine-type n1-standard-8 --master-boot-disk-type pd-balanced --master-boot-disk-size 500 --master-accelerator type=nvidia-tesla-t4 --image-version <any 2.1 or above image version> --optional-components JUPYTER --initialization-actions '< gcs_path to latest install GPU driver script >' --project <project_name>

I have also had issues with 2.0-ubuntu18 (even though it succeeds in installing the GPU drivers sometimes)

And the following are the error logs if it helps

E: Repository 'https://packages.cloud.google.com/apt google-cloud-logging-bionic-all InRelease' changed its 'Codename' value from 'google-cloud-logging-stretch-all' to 'google-cloud-logging-bionic-all'
akhanna213 commented 6 months ago

Hi @ahmedetefy @olbapjose , this looks like a different issue than what the users were facing earlier. Let me check with the team to understand what is causing this breakage. Appreciate your patience on this, let me get back to you as soon as possible.

olbapjose commented 6 months ago

Hi @akhanna213 do you have updates on this? Initially I was able to do a workaround by adding --allow-releaseinfo-change:

function update_apt_get() {
  retry_apt_command "apt-get update --allow-releaseinfo-change"
}

and it worked, but today it is failing again with a different message:

The following NEW packages will be installed: gnupg2 0 upgraded, 1 newly installed, 0 to remove and 3 not upgraded. Need to get 393 kB of archives. After this operation, 411 kB of additional disk space will be used. Err:1 http://deb.debian.org/debian buster/main amd64 gnupg2 all 2.2.12-1+deb10u1 404 Not Found [IP: 151.101.22.132 80] E: Failed to fetch http://deb.debian.org/debian/pool/main/g/gnupg2/gnupg2_2.2.12-1+deb10u1_all.deb 404 Not Found [IP: 151.101.22.132 80] E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?

I will try again with fix-missing but looks like the script is not robust as it is exposed to different possible points of failure.