computablelabs / spark-experiments

MIT License
1 stars 0 forks source link

spark-experiments

  1. Install awscli and setup aws credentials

sudo pip install awscli aws configure

  1. Install eksctl:

curl --silent --location "https://github.com/weaveworks/eksctl/releases/download/latest_release/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp sudo mv /tmp/eksctl /usr/local/bin

  1. Install Spark locally (Pyspark currently requires not yet released Spark, so need to build from source):

git clone https://github.com/apache/spark cd spark && ./build/mvn -DskipTests=true -Pkubernetes package && cd python && python setup.py sdist && sudo pip install dist/*.tar.gz

  1. Create eks cluster:

eksctl create cluster --name=computable-spark-test --nodes=4 --kubeconfig=./kubeconfig.spark-test.yaml --node-ami=auto export KUBECONFIG=pwd/kubeconfig.spark-test.yaml

  1. Verify kubernetes clusters is up and connectable:

kubectl get nodes

  1. Update core-site.xml to have AWS Access key ID and secret

  2. Run grant-api-role.sh to allow default service account to launch more pods for Spark

  1. Update Spark job script to run desire query in sql.py

  2. Push sql script to s3

aws s3 cp sql.py --acl public-read s3://computable-spark/sql.py

  1. Get K8s Master URL from kubeconfig.spark-test.yaml (clusters -> server) and update that to KUBE_MASTER env variable

export KUBE_MASTER=k8s://https://xxxxxxxx.amazonaws.com

  1. Submit Spark job to run query aganist S3

HADOOP_CONF_DIR=pwd spark-submit --deploy-mode cluster --master $KUBE_MASTER --conf spark.kubernetes.container.image=tnachen/spark-py:latest2 s3a://computable-spark/sql.py

  1. After Spark job completed, check kubernetes driver log for results

Find the latest completed driver pod name

kubectl get pods

kubectl logs