coiled / 1trc

BSD 3-Clause "New" or "Revised" License
74 stars 2 forks source link

SparkSQL on Amazon Elastic MapReduce (EMR) #5

Open SamWheating opened 4 months ago

SamWheating commented 4 months ago

Hey there, thanks for putting this together.

As another data point, I've written a SparkSQL implementation of this challenge - see https://github.com/SamWheating/1trc. I've included all of the steps for building and running the job locally as well as submitting it to EMR.

I've included a script for submitting the job to EMR (running Spark 3.4.1), and was able to verify the results like so:

> python scripts/emr_submit.py 
(wait until the job completes)
> aws s3 cp s3://samwheating-1trc/output/part-00000-3cbfb625-d602-4663-8cbe-2007d8f5458a-c000.csv - | head -n 10
Abha,83.4,-43.8,18.000157677827257
Abidjan,87.8,-34.4,25.99981470248089
Abéché,92.1,-33.6,29.400132310152458
Accra,86.7,-34.1,26.400022720245452
Addis Ababa,79.6,-58.0,16.000132105603647
Adelaide,76.8,-45.1,17.29980559679163
Aden,93.2,-33.1,29.100014447157122
Ahvaz,86.6,-37.5,25.3998722446478
Albuquerque,77.3,-54.7,13.99982215316457
Alexandra,72.6,-47.6,11.000355765170788

Overall, the results are pretty comparable to the dask results in your blog post - running on 32 m6i.xlarge instances this job completed in 32 minutes (incl. provisioning) for a total cost of ~$2.27 on spot instances:

32 machines * 32 minutes * ($0.085 spot rate + 0.048 EMR premium)/machine-hr * 1hr/60min = $2.2698

With larger/more machines this would probably be proportionally faster as this job is almost entirely parallelizable.

I haven't really spent much time optimizing / profiling this job, but figured this was an interesting starting point.

With some more time, I think it would be interesting to try:

mrocklin commented 4 months ago

Very cool. Thanks @SamWheating for sharing!

My sense is that a good next step on this effort is to collect a few of these results together and do a larger comparison across projects. Having SparkSQL in that comparison seems pretty critical. Thanks for your work here!