Open drei34 opened 1 year ago
I am using the above and did pip install sparkxgb
and now it seems like I'm having some success but deserializing is not working. I ran the code below and it seems OK but I wonder why would serialization work but deserialization not.
Error: Py4JJavaError: An error occurred while calling o467.deserializeFromBundle. : java.lang.NoClassDefFoundError: resource/package$ at ml.dmlc.xgboost4j.scala.spark.mleap.XGBoostClassificationModelOp$$anon$1.load(XGBoostClassificationModelOp.scala:48) at ml.dmlc.xgboost4j.scala.spark.mleap.XGBoostClassificationModelOp$$anon$1.load(XGBoostClassificationModelOp.scala:20) at ml.combust.bundle.serializer.ModelSerializer.$anonfun$readWithModel$2(ModelSerializer.scala:106) at scala.util.Success.$anonfun$map$1(Try.scala:255) at scala.util.Success.map(Try.scala:213)
import mleap.pyspark
from mleap.pyspark.spark_support import SimpleSparkSerializer
import pyspark
import pyspark.ml
from pyspark.sql import SparkSession
from pyspark.context import SparkContext
from pyspark.sql import HiveContext
import sparkxgb
from sparkxgb import XGBoostClassifier
sc = SparkContext()
sqlContext = HiveContext(sc)
spark = SparkSession.builder.config(
"spark.jars",
"/usr/lib/spark/jars/xgboost4j-1.7.3.jar,/usr/lib/spark/jars/mleap-xgboost-runtime_2.12-0.22.0.jar,/usr/lib/spark/jars/mleap-xgboost-spark_2.12-0.22.0.jar,/usr/lib/spark/jars/bundle-ml_2.12-0.22.0.jar"
).enableHiveSupport().getOrCreate()
spark.sparkContext.addPyFile("/usr/lib/spark/jars/pyspark-xgboost.zip")
features = ['dayOfWeek','hour','channel','platform','deviceType','adUnit','pageType','zip','advertiserId']
df = spark.createDataFrame([(7,20,3,6,1,10,3,53948,245351,1),(7,20,3,6,1,10,3,53948,245351,1)], features + ['label'])
xgboost = sparkxgb.XGBoostClassifier(
featuresCol="features",
labelCol="label")
pipeline = pyspark.ml.Pipeline(
stages=[
pyspark.ml.feature.VectorAssembler(inputCols=features, outputCol="features"),
xgboost
]
)
model = pipeline.fit(df)
predictions = model.transform(df)
local_path = "jar:file:/tmp/pyspark.example.zip"
model.serializeToBundle(local_path, predictions)
deserialized_model = pyspark.ml.PipelineModel.deserializeFromBundle(local_path)
deserialized_model.transform(df).show()
sc.stop()
hmmm, I'm not too familiar with sparkxgb
package. Is there a reason you aren't using the built in pyspark bindings for xgboost instead? They were just added in v1.7 so probably a lot of online guides still refer to other things.
idk if that would impact the error being generated though. It seems to be complaining an mleap class not being present in the jvm.
Oh I see. Can this model be used inside of a bundle (think I got an error but I could be wrong)? I.e. I should be able to use it as above and serialize / deserialize?
I mean, if I have a pipeline with such a model in it, I get an error as below:
AttributeError: 'SparkXGBClassifierModel' object has no attribute '_to_java'
hmmm @WeichenXu123 any thoughts on this since I know you added the pyspark bindings in xgboost.
To be clear this is the code I ran.
` import mleap.pyspark from mleap.pyspark.spark_support import SimpleSparkSerializer import pyspark import pyspark.ml from pyspark.sql import SparkSession from pyspark.context import SparkContext from pyspark.sql import HiveContext from xgboost.spark import SparkXGBClassifier
sc = SparkContext() sqlContext = HiveContext(sc) spark = SparkSession.builder.enableHiveSupport().getOrCreate()
features = ['dayOfWeek','hour','channel','platform','deviceType','adUnit','pageType','zip','advertiserId'] df = spark.createDataFrame([(7,20,3,6,1,10,3,53948,245351,1),(7,20,3,6,1,10,3,53948,245351,1)], features + ['label'])
xgboost = SparkXGBClassifier( features_col="features", label_col="label", num_workers=2 )
pipeline = pyspark.ml.Pipeline( stages=[ pyspark.ml.feature.VectorAssembler(inputCols=features, outputCol="features"), xgboost ] )
model = pipeline.fit(df) predictions = model.transform(df)
local_path = "jar:file:/tmp/pyspark.example.zip" model.serializeToBundle(local_path, predictions) deserialized_model = pyspark.ml.PipelineModel.deserializeFromBundle(local_path)
deserialized_model.stages[-1].set(deserialized_model.stages[-1].missing, 0.0) deserialized_model.transform(df).show()
sc.stop()
`
Could you try mleap 0.20 version ? I remember similar issue happens since version > 0.20
Thanks so much for this. Give me a few days to try this and circle back. We actually got it working on 0.22 after changing some random files (wrapper.py in pyspark code). But I'd be interested to see if this works ...
Thanks so much for this. Give me a few days to try this and circle back. We actually got it working on 0.22 after changing some random files (wrapper.py in pyspark code). But I'd be interested to see if this works ...
Thanks for your investigation! Would you mind file a PR containing your changes that makes it work ? So that it can help us fix it.
I don't have a PR, but I can tell you some specs. Basically we needed to wrap up a bunch of jars into one and run the commands like below. This involves installing Java 11 to make 0.22 work but also we needed to change some specific things inside of spark related files. Specifically, see the changes needed to wrapper.py to make xgboost 1.7.3 work. See also the change to spark-env-sh to make the right Java be used. The jars I am using are a wrap up of the mleap jars found on maven. That sparkxgb zip is one I found online, it wraps xgboost so that it can be used in pyspark.
`
sudo apt-get -y install openjdk-11-jdk export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
echo "export JAVA_HOME=\"/usr/lib/jvm/java-11-openjdk-amd64\"" >> $SPARK_HOME/conf/spark-env.sh
export PYSPARK_PYTHON=/opt/conda/miniconda3/bin/python export PYSPARK_DRIVER_PYTHON=/opt/conda/miniconda3/bin/python
pip install mleap==0.22.0
sudo sed -i.bak 's/spark.dynamicAllocation.enabled=true/spark.dynamicAllocation.enabled=false/' $SPARK_HOME/conf/spark-defaults.conf sudo sed -i.bak '/.replace(\"org.apache.spark\", \"pyspark\")/s/$/.replace(\"ml.dmlc.xgboost4j.scala.spark\", \"sparkxgb.xgboost\")/' $SPARK_HOME/python/pyspark/ml/wrapper.py
sudo gsutil cp gs://filesForDataproc2.0/jarsAndZip/* $SPARK_HOME/jars/ `
Hi, What is the best way to use XGBoost with mleap=0.22.0. Specifically, I'd like to make a pipeline and then serialize it into a bundle. Is there a demo somewhere on how to do this? @jsleight