amogkam / batch-inference-benchmarks

15 stars 5 forks source link

Consider use the new predict_batch_udf in Spark 3.4 #4

Open chenliu0831 opened 1 year ago

chenliu0831 commented 1 year ago

See https://spark.apache.org/docs/3.4.0/api/python/reference/api/pyspark.ml.functions.predict_batch_udf.html and https://developer.nvidia.com/blog/distributed-deep-learning-made-easy-with-spark-3-4/. The current implementation seems to be subject to issues below.

The predict_batch_udf introduces standardized code for:

  • Translating Spark DataFrames into NumPy arrays, so the end-user DL inferencing code does not need to convert from a Pandas DataFrame.
  • Batching the incoming NumPy arrays for the DL frameworks.
  • Model loading on the executors, which avoids any model serialization issues, while leveraging the Spark spark.python.worker.reuse configuration to cache models in the Spark executors.