Open lossyrob opened 7 years ago
@azavea/operations relevant to #1 and might be useful to track.
For discussion, I think that it would be good to break out the EMR deployment tasks for this into three parts:
Looking at the current state of the aws_emr_cluster
Terraform resource, I think that they now support most of the features we'd need to make this happen (path for managing the EMR managed security groups safely, spot pricing, application configurations, etc.). The notable missing features are spot fleet and auto scaling support (the latter may be indirectly possible via application autoscaling; at leastaws_emr_cluster
has support for supplying an autoscaling role).
For the other tasks, I think we can either get them for free out of the tasks above (tear down cluster, create a read-only HBase on S3 cluster), or they can be separate tasks that layer on top of the base EMR work (add dependencies necessary for ORC ingest, support running it via the Step API, install Zeppelin when in read-only mode).
Thanks for the pointers.
We've been using EMR with the latest AWS plugin for terraform - it requires a custom build though until the next release happens. We need the latest because it introduces spot instances.
What would be your recommended way of interacting with EMR in various tasks? We have traditionally used makefiles for this type of work. e.g. I want to bring up an ingest cluster, set up a proxy to it, have a convenient way to ssh into the master, then terminate it; same for the zeppelin cluster. To me Makefiles are a nice way to drive this type of work, where there are repetitive cli tasks to perform with some information that might need to be parsed form a file. I also know Azavea ops doesn't utilize Makefiles. Do you have a suggested alternative for that type of interaction?
We've been using EMR with the latest AWS plugin for terraform - it requires a custom build though until the next release happens. We need the latest because it introduces spot instances.
I believe that this was released in the 1.0.0 release of the Terraform AWS provider. Making use of this provider requires a Terraform 0.10.x project.
I created https://github.com/azavea/operations/issues/128 for the EMR Terraform module and added it to our current sprint.
What would be your recommended way of interacting with EMR in various tasks? We have traditionally used makefiles for this type of work. e.g. I want to bring up an ingest cluster, set up a proxy to it, have a convenient way to ssh into the master, then terminate it; same for the zeppelin cluster. To me Makefiles are a nice way to drive this type of work, where there are repetitive cli tasks to perform with some information that might need to be parsed form a file. I also know Azavea ops doesn't utilize Makefiles. Do you have a suggested alternative for that type of interaction?
Historically, this level of automation ends up happening in Bash scripts or Python based CLIs. The former for less sophisticated setups (driving the AWS CLI w/ variables) and the latter for the more sophisticated ones (use Boto, wait for success state, add error handling, etc.). I'm not opposed to Makefiles, it is just my personal opinion that the farther you get from local file based interactions, the less you play to make
strengths. It is also inconsistent with Bash in ways that can sometimes lead down rabbit holes.
For this, I feel like anything that can drive Terraform and read from its outputs would be good to use. The core resource emits id
and master_public_dns
, which we'll make sure the higher level Terraform module emits as well.
If the EMR resource works the way it is advertised, we should be able to launch and bootstrap an EMR cluster from Terraform. From there, another tool can retrieve the cluster ID and master FQDN to do whatever is necessary. The AWS CLI can help establish a SOCKS tunnel provided is has the cluster ID and private key. Destruction can occur via Terraform as well.
Moving comments I made on another PR to this issue:
@hectcastro thanks, I'll look to that PR for implementation.
Do you have any guidance on separating out various components of the infrastructure, to be brought up and down independently? My current approach is to separate out terraform files, e.g.
deployment/terraform/analytics/*.tf
deployment/terraform/ingest/*.tf
deployment/terraform/update/*.tf
... etc
and then have separate scripts/analytics.sh
etc to deploy specific aspects, as opposed to a single infra.sh
.
Also, I want to allow the option of having local tfvars
, so that you don't necessarily have to upload and download them during development. Currently the approach is to have this in the script:
if [[ -n "${OSMESA_SETTINGS_BUCKET}" ]]; then
aws s3 cp "s3://${OSMESA_SETTINGS_BUCKET}/terraform/osmesa/terraform.tfvars" \
"${TERRAFORM_DIR}/${OSMESA_SETTINGS_BUCKET}.tfvars"
TERRAFORM_SETTINGS=${OSMESA_SETTINGS_BUCKET}.tfvars
else
TERRAFORM_SETTINGS=${PWD}/deployment/terraform/terraform.tvars
fi
If all this sounds sane, I'll move forward with it, unless you have suggestions on a better approach. Thanks
Looks like rasterfoundry deployment uses a singe infra script, and uses --arg to differentiate components. I'll mimic this.
Develop the necessary terraform scripts to bring up an OSMesa stack for the Ingest and Global Analytics - Zeppelin Notebook components.
aws emr add-steps
)