GoogleCloudDataproc / bdutil

[DEPRECATED] Script used to manage Hadoop and Spark instances on Google Compute Engine
https://cloud.google.com/dataproc
Apache License 2.0
109 stars 94 forks source link

Stability of script for HDP platform #53

Open stev-0 opened 9 years ago

stev-0 commented 9 years ago

This isn't meant as a criticism, as I realise there are 1,000 possible things that could be going wrong, but this script seems to only successfully deploy a cluster in around 1 in 5 attempts.

The exception seems to be different each time, but the common ones are: at upload of config scripts:

Uploading   ...20150811-000113-Hq6/install-ambari-components.sh: 3.9 KiB/3.9 KiB
CommandException: 1 files/objects could not be transferred.

when running deploy scripts on master / workers:

Mon, Aug 10, 2015 11:55:27 PM: Exited 1 : gcloud --project=yyyy --quiet --verbosity=info compute   ssh hadoop-w-1 --command=sudo su -l -c "cd ${PWD} && ./ambari-setup.sh" 2>>ambari-setup_deploy.stderr 1>>ambari-setup_deploy.stdout --ssh-flag=-tt --ssh-flag=-oServerAliveInterval=60 --  ssh-flag=-oServerAliveCountMax=3 --ssh-flag=-oConnectTimeout=30 --zone=europe-west1-b
 Mon, Aug 10, 2015 11:55:28 PM: Fetching on-VM logs from hadoop-w-1
 Warning: Permanently added 'x.y.z.m' (RSA) to the list of known hosts.
...Mon, Aug 10, 2015 11:57:43 PM: Command failed: wait ${SUBPROC} on line 326.

during the ambari-components install

 Mon, Aug 10, 2015 11:43:54 PM: Step 'deploy-client-nfs-setup,deploy-client-nfs-setup' done...

Mon, Aug 10, 2015 11:43:54 PM: Invoking on master: ./install-ambari-components.sh ../bdutil: line 318: 10548 Segmentation fault sleep '0.5'

By their nature they are hard to reproduce, as I am running the same script each time.

dennishuo commented 9 years ago

Thanks, every report helps :)

The "Segmentation fault" error is something we've never seen before; do you happen to know if the errors you're hitting are specific to ambari_env.sh, or do they also happen when you try to deploy default bdutil clusters?