SparkSQL on Amazon Elastic MapReduce (EMR)

Hey there, thanks for putting this together.

As another data point, I've written a SparkSQL implementation of this challenge - see https://github.com/SamWheating/1trc. I've included all of the steps for building and running the job locally as well as submitting it to EMR.

I've included a script for submitting the job to EMR (running Spark 3.4.1), and was able to verify the results like so:

> python scripts/emr_submit.py 
(wait until the job completes)
> aws s3 cp s3://samwheating-1trc/output/part-00000-3cbfb625-d602-4663-8cbe-2007d8f5458a-c000.csv - | head -n 10
Abha,83.4,-43.8,18.000157677827257
Abidjan,87.8,-34.4,25.99981470248089
Abéché,92.1,-33.6,29.400132310152458
Accra,86.7,-34.1,26.400022720245452
Addis Ababa,79.6,-58.0,16.000132105603647
Adelaide,76.8,-45.1,17.29980559679163
Aden,93.2,-33.1,29.100014447157122
Ahvaz,86.6,-37.5,25.3998722446478
Albuquerque,77.3,-54.7,13.99982215316457
Alexandra,72.6,-47.6,11.000355765170788

Overall, the results are pretty comparable to the dask results in your blog post - running on 32 m6i.xlarge instances this job completed in 32 minutes (incl. provisioning) for a total cost of ~$2.27 on spot instances:

32 machines * 32 minutes * ($0.085 spot rate + 0.048 EMR premium)/machine-hr * 1hr/60min = $2.2698

With larger/more machines this would probably be proportionally faster as this job is almost entirely parallelizable.

I haven't really spent much time optimizing / profiling this job, but figured this was an interesting starting point.

With some more time, I think it would be interesting to try:

pre-loading the dataset to cluster-local HDFS (to reduce the time spent downloading data)
re-running with an increased value of spark.sql.files.maxPartitionBytes in order to reduce scheduling / task overhead.

Anyways, let me know what you think, or if you've got other suggestions for improving this.

coiled / 1trc

SparkSQL on Amazon Elastic MapReduce (EMR) #5