GoogleCloudDataproc / initialization-actions

Run in all nodes of your cluster before the cluster starts - lets you customize your cluster
https://cloud.google.com/dataproc/init-actions
Apache License 2.0
588 stars 512 forks source link

[oozie] Create and call function to await HDFS Data Nodes #1078

Closed cjac closed 1 year ago

cjac commented 1 year ago

Some users are experiencing errors when init actions begin running before dfs has come online. This patch pauses before the first call to dfs in order to avoid this race condition.

This patch is intended as the fix for #1077

cjac commented 1 year ago

/gcbrun

cjac commented 1 year ago

The pause needs to happen before the initialization actions are launched. We can stagger the cluster node startup with a script such as the following.

#!/bin/bash

set -x

readonly ROLE="$(/usr/share/google/get_metadata_value attributes/dataproc-role)"
if [[ "${ROLE}" != 'Master' ]]; then set +x; exit 0; fi

node_number=$(echo ${HOSTNAME} | perl -ne '/-m-(\d+)/; print $1')
delay_seconds=$((${node_number} * 60))
sleep ${delay_seconds}s

NOW=$(date +"%F-%T")
echo "instance #${node_number} (${HOSTNAME}) proceeds at ${NOW}" | tee /var/log/delay-masters.log

set +x

Once the agent has launched and kicks off the init actions, the patched script should be capable of running even if the HDFS store does not come online ever.

cjac commented 1 year ago

/gcbrun

cjac commented 1 year ago

/gcbrun

cjac commented 1 year ago

I'm going to close this CL and open a new one. Somehow we got one of nvliyuan's ephemeral email addresses as a contributor. This was likely due to me rebasing my changes on a different branch.

cjac commented 1 year ago

continued in #1089