Closed cjac closed 1 year ago
/gcbrun
The pause needs to happen before the initialization actions are launched. We can stagger the cluster node startup with a script such as the following.
#!/bin/bash
set -x
readonly ROLE="$(/usr/share/google/get_metadata_value attributes/dataproc-role)"
if [[ "${ROLE}" != 'Master' ]]; then set +x; exit 0; fi
node_number=$(echo ${HOSTNAME} | perl -ne '/-m-(\d+)/; print $1')
delay_seconds=$((${node_number} * 60))
sleep ${delay_seconds}s
NOW=$(date +"%F-%T")
echo "instance #${node_number} (${HOSTNAME}) proceeds at ${NOW}" | tee /var/log/delay-masters.log
set +x
Once the agent has launched and kicks off the init actions, the patched script should be capable of running even if the HDFS store does not come online ever.
/gcbrun
/gcbrun
I'm going to close this CL and open a new one. Somehow we got one of nvliyuan's ephemeral email addresses as a contributor. This was likely due to me rebasing my changes on a different branch.
continued in #1089
Some users are experiencing errors when init actions begin running before dfs has come online. This patch pauses before the first call to dfs in order to avoid this race condition.
This patch is intended as the fix for #1077