awslabs / deeplearning-emr

Scripts and instructions to facilitate running Deep Learning Tasks on Amazon EMR
Apache License 2.0
62 stars 16 forks source link

Cluster creation fails during bootstrap actions when using a DL AMI #8

Open rootAvish opened 6 years ago

rootAvish commented 6 years ago

Hi,

I was using the command below to create a test EMR cluster:

aws emr create-cluster \
--release-label emr-5.12.0 \
--instance-type p3.2xlarge \
--instance-count 1 \
--ec2-attributes InstanceProfile=EMR_EC2_DefaultRole,KeyName=[redacted],\
EmrManagedMasterSecurityGroup=[redacted],EmrManagedSlaveSecurityGroup=[redacted],SubnetId=[redacted] \
--service-role EMR_DefaultRole \
--custom-ami-id [redacted] \
--log-uri [redacted] \
--name chester-dev-test \
--region us-east-1

Where the deep learning AMI ID I'm using is the us-east-1 (N. Virginia) AMI ID for https://aws.amazon.com/marketplace/pp/B076T8RSXY and I've tried several other AMIs, including the base deep learning AMI: https://aws.amazon.com/marketplace/pp/B077GFM7L7, and every EMR version starting from 5.8.0 and all GPU instance type possible to bring up the cluster.

However the cluster always fails while executing the Amazon defined Bootstrap actions (not the user defined ones) and there is no stderr in the bootstrap actions folder, however when I looked under provision-node/<node-id>/stderr.gz, all the attempts have failed with the same error:

Error: Could not start Service[hadoop-hdfs-namenode]: Execution of '/sbin/start hadoop-hdfs-namenode' returned 1: start: Job failed to start
Wrapped exception:
Execution of '/sbin/start hadoop-hdfs-namenode' returned 1: start: Job failed to start
Error: /Stage[main]/Hadoop::Namenode/Service[hadoop-hdfs-namenode]/ensure: change from stopped to running failed: Could not start Service[hadoop-hdfs-namenode]: Execution of '/sbin/start hadoop-hdfs-namenode' returned 1: start: Job failed to start
Warning: /Stage[main]/Hadoop::Datanode/Package[hadoop-hdfs-datanode]: Skipping because of failed dependencies
Warning: /Stage[main]/Hadoop::Datanode/Hadoop::Create_storage_dir[/mnt/hdfs]/Exec[mkdir /mnt/hdfs]: Skipping because of failed dependencies
Warning: /Stage[main]/Hadoop::Datanode/File[/mnt/hdfs]: Skipping because of failed dependencies
Warning: /Stage[main]/Hadoop::Datanode/File[/etc/default/hadoop-hdfs-datanode]: Skipping because of failed dependencies
Warning: /Stage[main]/Hadoop::Datanode/Service[hadoop-hdfs-datanode]: Skipping because of failed dependencies
Warning: /Stage[main]/Hadoop::Init_hdfs/File[/var/lib/hadoop-hdfs/init-hcfs.json]: Skipping because of failed dependencies
Warning: /Stage[main]/Hadoop::Init_hdfs/Exec[hdfs ready]: Skipping because of failed dependencies
Warning: /Stage[main]/Hadoop::Init_hdfs/Exec[init hdfs]: Skipping because of failed dependencies
Error: Could not start Service[hadoop-yarn-proxyserver]: Execution of '/sbin/start hadoop-yarn-proxyserver' returned 1: start: Job failed to start
Wrapped exception:
Execution of '/sbin/start hadoop-yarn-proxyserver' returned 1: start: Job failed to start
Error: /Stage[main]/Hadoop::Proxyserver/Service[hadoop-yarn-proxyserver]/ensure: change from stopped to running failed: Could not start Service[hadoop-yarn-proxyserver]: Execution of '/sbin/start hadoop-yarn-proxyserver' returned 1: start: Job failed to start
Error: Could not start Service[hadoop-yarn-timelineserver]: Execution of '/sbin/start hadoop-yarn-timelineserver' returned 1: start: Job failed to start
Wrapped exception:
Execution of '/sbin/start hadoop-yarn-timelineserver' returned 1: start: Job failed to start
Error: /Stage[main]/Hadoop::Timelineserver/Service[hadoop-yarn-timelineserver]/ensure: change from stopped to running failed: Could not start Service[hadoop-yarn-timelineserver]: Execution of '/sbin/start hadoop-yarn-timelineserver' returned 1: start: Job failed to start
Error: Could not start Service[hadoop-yarn-resourcemanager]: Execution of '/sbin/start hadoop-yarn-resourcemanager' returned 1: start: Job failed to start
Wrapped exception:
Execution of '/sbin/start hadoop-yarn-resourcemanager' returned 1: start: Job failed to start
Error: /Stage[main]/Hadoop::Resourcemanager/Service[hadoop-yarn-resourcemanager]/ensure: change from stopped to running failed: Could not start Service[hadoop-yarn-resourcemanager]: Execution of '/sbin/start hadoop-yarn-resourcemanager' returned 1: start: Job failed to start
Warning: /Stage[main]/Hadoop::Resourcemanager/Exec[yarn rmadmin -refreshQueues]: Skipping because of failed dependencies
Error: /Stage[main]/Hadoop::Resourcemanager/Exec[yarn rmadmin -refreshQueues]: Failed to call refresh: Command exceeded timeout
Error: /Stage[main]/Hadoop::Resourcemanager/Exec[yarn rmadmin -refreshQueues]: Command exceeded timeout
Wrapped exception:
execution expired
Warning: /Stage[main]/Hadoop::Nodemanager/Package[hadoop-yarn-nodemanager]: Skipping because of failed dependencies
Warning: /Stage[main]/Hadoop::Nodemanager/Hadoop::Create_storage_dir[/mnt/yarn]/Exec[mkdir /mnt/yarn]: Skipping because of failed dependencies
Warning: /Stage[main]/Hadoop::Nodemanager/File[/mnt/yarn]: Skipping because of failed dependencies
Warning: /Stage[main]/Hadoop::Nodemanager/Service[hadoop-yarn-nodemanager]: Skipping because of failed dependencies
Error: Could not start Service[hadoop-mapreduce-historyserver]: Execution of '/sbin/start hadoop-mapreduce-historyserver' returned 1: start: Job failed to start
Wrapped exception:
Execution of '/sbin/start hadoop-mapreduce-historyserver' returned 1: start: Job failed to start
Error: /Stage[main]/Hadoop::Historyserver/Service[hadoop-mapreduce-historyserver]/ensure: change from stopped to running failed: Could not start Service[hadoop-mapreduce-historyserver]: Execution of '/sbin/start hadoop-mapreduce-historyserver' returned 1: start: Job failed to start
2018-03-08 15:20:22,380 ERROR main: Encountered a problem while provisioning
com.amazonaws.emr.node.provisioner.puppet.api.PuppetException: Unable to complete transaction and some changes were applied.
    at com.amazonaws.emr.node.provisioner.puppet.api.ApplyCommand.handleExitcode(ApplyCommand.java:74)
    at com.amazonaws.emr.node.provisioner.puppet.api.ApplyCommand.call(ApplyCommand.java:56)
    at com.amazonaws.emr.node.provisioner.bigtop.BigtopPuppeteer.applyPuppet(BigtopPuppeteer.java:50)
    at com.amazonaws.emr.node.provisioner.bigtop.BigtopDeployer.deploy(BigtopDeployer.java:21)
    at com.amazonaws.emr.node.provisioner.NodeProvisioner.provision(NodeProvisioner.java:25)
    at com.amazonaws.emr.node.provisioner.phase.PhaseWorkflow.work(PhaseWorkflow.java:56)
    at com.amazonaws.emr.node.provisioner.phase.ProvisionHadoopPhase.perform(ProvisionHadoopPhase.java:21)
    at com.amazonaws.emr.node.provisioner.Program.main(Program.java:20)

With no details around why exactly the historyserver is failing to start. Is there an installation step this guide this missing when using the DL AMIs? This error never occurs when using the default AMI of EMR. I'm using 60GB EBS root volume size and 35GB of attached EBS storage, if that matters. Tried the p3.2x and g2.2x large instance types with just one master instance and 0 core and 0 task nodes.

ElliotSwart commented 6 years ago

Run the following on the Deep Learning AMI log directory sudo chmod 777 /var/log Though if you care, only executable permission (for all users) is required Then take an image and use that.

EMR is copying log directory and establishing files in it, but permissions set up on AMI are drwx------

and need to at least be drwx-----x