Script how to deploy Palisade on AWS EMR cluster

Deploying Palisade on AWS EMR (in the Ireland region) is expected to be done via a Terraform template that spins up a 4 node EMR cluster with Hadoop, Zookeeper, JupyterHub (to provide an interactive interface to run the queries), Ganglia (To monitor resource usage), Spark (to run the spark client when that issue is complete). It is expected that by default there would be a data service running on each of the core nodes and one of each of the rest of the services would be running on the master node, but with a easily configurable setting to increase the number of each service running. The process flow of running the AWS deployment example is as follows:

user supplies KeyPair and other inputs when running a bash script
Bash script will trigger the Terraform script, which builds the EMR cluster and deploys the relevant services and applies the required configuration. It will also run a job to generate a 3 million employee dataset split over 3 files (one per core node)
Then another bash script will run the map reduce example recording the output data into a tmp file and also logging some statistics about the time taken to process each file.

gchq / Palisade

Script how to deploy Palisade on AWS EMR cluster #114