Closed parano closed 3 years ago
@parano Do you want to consider https://github.com/combust/mleap for deploying spark models to production
@Sharathmk99 mleap is probably not a good default here. It is possible to add a mleap option that lets BentoML use mleap for batch inferencing when deploying the model to a Spark cluster, although deploying BentoService to Spark cluster for batch inferencing is no yet supported
Hi @parano, just commenting to indicate my interest in this as a larger project for the summer, once I am done with the integration test issues. :)
I'm just doing some starting research to understand the scope of this feature a bit more. I'm a bit new to Spark/PySpark, so I'm learning as I go... I've really wanted to learn how to use it, though, so this is quite exciting. Apologies in advance for any newbie moments. :smile:
Anyways, here are some design points that have multiple options, and could be discussed further:
spark.mllib
support (RDD-based) vs. spark.ml
(Spark DataFrame-based)
spark.mllib
is in maintenance mode, so may be similar to TF1 support. Old plans were to deprecate the RDD API but these plans seem to be removed in the latest docs, so the future of the RDD API is a little unclear.spark.mllib
and spark.ml
... hm. model.save()
/model.load()
(as shown in issue description) vs. PMML export (seems to be spark.mllib
-specific, not as well-supported)PySparkDataframeAdapter
might be outside of scope... still, would complement a PySparkModelArtifact
quite nicely :).travis.yml
configuration that attempts to handle this.A lot of these decisions seem like they can be simplified by prioritising the currently-recommended spark.ml
(DataFrame) API to start. (This seems analogous to prioritising TF2 support, I think.) Then, later exploring backwards compatibility using tests (like what was done with TF1 tests.)
But, supporting RDDs/spark.mllib
might be more crucial than I realize... thoughts?
Wow, this issue has a lucky number. 😄 Any progress on this?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Is your feature request related to a problem? Please describe. Add support for Spark MLlib models in BentoML
Describe the solution you'd like Add a new model artifact class
PySparkModelArtifact
, here is the example usage:PySpark models can't be directly pickled so it does not work with PickleArtifact. PySparkModelArtifact uses SparkSession and pyspark_model's
save
andload
under the hood. e.g.:Save:
Load:
Sample code based on https://github.com/ucbrise/clipper/blob/develop/containers/python/pyspark_container.py#L27
Describe alternatives you've considered n/a
Additional context n/a