aws-samples / otl-service-launcher

MIT No Attribution
3 stars 3 forks source link

module.emr_cluster - BOOTSTRAP_FAILURE prevents EMR from coming up during stack creation #7

Open adam-imeson opened 3 years ago

adam-imeson commented 3 years ago

We create the stack and get this:

Error: Error waiting for EMR Cluster state to be "WAITING" or "RUNNING": TERMINATING: BOOTSTRAP_FAILURE: On the master instance (i-0869e1ee0103b696d), application provisioning failed

Expected behavior: functioning EMR cluster. Observed behavior: stack creation fails with an error.

Not sure what's going on here yet. Wild-ass guess: an OS-level issue with something defined in https://github.com/Outposts-Test-Lab/otl-service-launcher/blob/main/modules/emr/emr_cluster_configurations.json


Need to add the log_uri argument to the cluster's definition so that I can see what's happening during the bootstrap: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/emr_cluster#log_uri


After fixing the security group duplicate issue (#4) the cluster now takes 48 minutes to fail on creation, and throws this error instead of the BOOTSTRAP_FAILURE as before:

Error: Error waiting for EMR Cluster state to be "WAITING" or "RUNNING": TERMINATING: INTERNAL_ERROR: Failed to start the job flow due to an internal error

Going to go see what the back end has to say about this one.


Debugging found that the KMS policy governing access to the default encryption key wasn't allowing the service role defined for this cluster. I am switching it to use the default EMR_DefaultRole and will see if that fixes the issue.


Aaaand now we're back here:

Error: Error waiting for EMR Cluster state to be "WAITING" or "RUNNING": TERMINATING: BOOTSTRAP_FAILURE: On the master instance (i-020b7502319583698), application provisioning failed

Turned on logging and will dive that to see what happened.


master.log.gz contents:

2021-03-04 22:13:35,153 INFO i-020b7502319583698: new instance started
2021-03-04 22:13:37,542 INFO i-020b7502319583698: bootstrap action 1 completed
2021-03-04 22:13:37,545 ERROR i-020b7502319583698: failed to start. bootstrap action 2 failed with non-zero exit code.

The above is weird because there's only one bootstrap action listed for the cluster in the Console. It's either a bug or a misconfig, and I don't understand EMR well enough to tell which.


Currently wondering if there's an issue with emr_cluster_configurations.json that is duplicating a bootstrap action. Going to try deleting the spark-env block and see what happens.


This is weird. controller logs indicate that bootstrap actions were successful. Also, there's only one sub-directory in S3 for bootstrap logs. Whatever that second bootstrap action was, it isn't leaving a trace in the console or in the logging.

controller.gz contents:

2021-03-04T22:13:35.159Z INFO Fetching file 's3://elasticmapreduce/bootstrap-actions/run-if'
2021-03-04T22:13:35.468Z INFO startExec '/emr/instance-controller/lib/bootstrap-actions/1/run-if instance.isMaster=true echo running on master node'
2021-03-04T22:13:35.468Z INFO Environment:
  PATH=/usr/lib64/qt-3.3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
  SECURITY_PROPERTIES=/emr/instance-controller/lib/security.properties
  HISTCONTROL=ignoredups
  LESSOPEN=||/usr/bin/lesspipe.sh %s
  SHELL=/bin/bash
  QTINC=/usr/lib64/qt-3.3/include
  HISTSIZE=1000
  JAVA_HOME=/etc/alternatives/jre
  AWS_DEFAULT_REGION=us-east-1
  USER=hadoop
  LANG=en_US.UTF-8
  MAIL=/var/spool/mail/hadoop
  HOSTNAME=ip-10-132-42-122
  QTDIR=/usr/lib64/qt-3.3
  LOGNAME=hadoop
  PWD=/
  QTLIB=/usr/lib64/qt-3.3/lib
  HOME=/home/hadoop
  SHLVL=1
  _=/etc/alternatives/jre/bin/java
2021-03-04T22:13:35.468Z INFO redirectOutput to /emr/instance-controller/log/bootstrap-actions/1/stdout
2021-03-04T22:13:35.469Z INFO redirectError to /emr/instance-controller/log/bootstrap-actions/1/stderr
2021-03-04T22:13:35.469Z INFO Working dir /emr/instance-controller/lib/bootstrap-actions/1
2021-03-04T22:13:35.541Z INFO ProcessRunner started child process : /emr/instance-controller/lib/bootstrap-actions/...
2021-03-04T22:13:35.541Z INFO Synchronously wait child process to complete : /emr/instance-controller/lib/bootstrap-actions/...
2021-03-04T22:13:37.541Z INFO waitProcessCompletion ended with exit code 0 : /emr/instance-controller/lib/bootstrap-actions/...
2021-03-04T22:13:37.541Z INFO total process run time: 2 seconds
2021-03-04T22:13:37.541Z INFO Execution succeeded

There was no change after taking out the spark-env statement. I did some log diving on a back end system and found evidence pointing toward a possible issue involving EMR's interactions with the new EBS Local Snapshots feature for Outposts. Going to investigate internally and return to this ticket once I've driven that to conclusion.