Open SamWheating opened 8 months ago
Very cool. Thanks @SamWheating for sharing!
My sense is that a good next step on this effort is to collect a few of these results together and do a larger comparison across projects. Having SparkSQL in that comparison seems pretty critical. Thanks for your work here!
Hey there, thanks for putting this together.
As another data point, I've written a SparkSQL implementation of this challenge - see https://github.com/SamWheating/1trc. I've included all of the steps for building and running the job locally as well as submitting it to EMR.
I've included a script for submitting the job to EMR (running Spark 3.4.1), and was able to verify the results like so:
Overall, the results are pretty comparable to the dask results in your blog post - running on 32 m6i.xlarge instances this job completed in 32 minutes (incl. provisioning) for a total cost of ~$2.27 on spot instances:
With larger/more machines this would probably be proportionally faster as this job is almost entirely parallelizable.
I haven't really spent much time optimizing / profiling this job, but figured this was an interesting starting point.
With some more time, I think it would be interesting to try:
re-running with an increased value of
spark.sql.files.maxPartitionBytes
in order to reduce scheduling / task overhead.Anyways, let me know what you think, or if you've got other suggestions for improving this.