amogkam / batch-inference-benchmarks

15 stars 5 forks source link

Comparison feedback #3

Open morelen17 opened 1 year ago

morelen17 commented 1 year ago

Hey @amogkam !

Thanks for blog post! Although I found the Ray vs other services results impressive I decided to conduct my own experiments on AWS Sagemaker. After reviewing the source code and running my benchmarks, I am ready to share the results and my concerns regarding the comparison approach.


Concerns:

  1. You compared the performance of Ray on parquet data (batched reading, preprocessing, inference) and Sagemaker Batch Transform on image data (single image per request [x4 instance count]).
  2. For Ray you computed script execution time (source code) while for Sagemaker you measured the whole Batch Transform job time (last two cells in the corresponding notebook), which includes instance provisioning, docker image pull, etc.
  3. No cost comparison has been carried out. In Sagemaker, 4 x ml.g4dn.xlarge are ~40% cheaper than 1 x ml.g4dn.12xlarge ($0.736 vs $4.89 per hour for compute only).
Benchmark results (10GB dataset): Service Job type Data Settings Throughput (billed time), img/s Throughput (script time), img/s Price, $
Ray Sagemaker Processing job 16 x 190 MB parquet files Same as in source code 50.26 101.11 0.44
Sagemaker Sagemaker Batch Transform job 120 x ~25.3 MB parquet files max_concurrent_transforms=2, max_payload=50, the rest as in the notebook 58.69 - 0.23

Would love to discuss my results! And please feel free to point out if I missed something or if I'm wrong about anything. Also ping me if any clarifications from my side are required. Thank you!

rbavery commented 2 weeks ago

bump! curious if these concerns have been looked at. The results were recently presented at Ray Summit 2024 but these concerns were not discussed in the presentation, just the original performance numbers.

amogkam commented 1 week ago

Thanks for the bumping this and thank you for the thoughtful initial post.

Addressing the concerns:

  1. I had not had a chance to try out the patch, but indeed if the issue with reading parquet files in Sagemaker is fixed with that change, then that should be used for the benchmark.
  2. That is a good callout. Cluster startup time on Anyscale for the Ray benchmark and Databricks for the Spark benchmark should be included. Note that the first row in the results table that is linked is Ray on Sagemaker, which is not one of the reported configurations in the original post. What is reported in the blog post is Ray on Anyscale.
  3. That’s right there was no cost comparison in the blog post. But it can be calculated via the instance pricing in the region that you want and the total job completion time.

With these changes, the exact benchmark numbers will be different, but I don’t expect any major changes in the overall trends/takeaways from the benchmarks. For fully updated numbers, it would probably be best to re run all the benchmarks with more recent updates from the past year.

rbavery commented 1 week ago

Thanks @amogkam !

I do think that the 17x sagemaker performance multiple might fall quite steeply if cluster startup and docker pull are included. Curious if plans come up for Anyscale to rerun these, would love to see the results.