autodeployai / pmml4s-spark

PMML scoring library for Spark as SparkML Transformer
Apache License 2.0
19 stars 8 forks source link

transform dataframe got exception in scala-spark #2

Closed yuehanlyu closed 3 years ago

yuehanlyu commented 3 years ago

Hi! I'm trying to use AWS EMR to score a Dataframe using a pmml model, but got error that is not explicit to trace. Any help would be appredicated.

Read the dataframe: val df = spark.read.parquet("myDataframe.parquet")

df.show()works fine.

Read the pmml model: val model = ScoreModel.fromFile("myxgboostModel.pmml")

Then run the following code: model.transform(df).show()

Gave an error message: java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD

Environment: Spark 2.4.4, Zeppelin 0.8.2.

However, I didn't meet this error when I was coding with Intellij on my macbook.

scorebot commented 3 years ago

@yuehanlyu I can not reproduce your issue, it could be an environmental problem. I googled the error message, see here: https://stackoverflow.com/questions/39953245/how-to-fix-java-lang-classcastexception-cannot-assign-instance-of-scala-collect

It seems the error could be caused by missing jars in the Spark worker nodes, please try to add the following related jars:

pmml4s-spark_2.11-0.9.7.jar
pmml4s_2.11-0.9.7.jar
commons-text-1.6.jar
spray-json_2.11-1.3.5.jar

You can get those jars from here: https://github.com/autodeployai/pypmml-spark/tree/master/pypmml_spark/jars

yuehanlyu commented 3 years ago

@scorebot, thanks for your help!

I change the environment to Spark 2.4.0, and upload three jars

pmml4s-spark_2.11-0.9.7.jar
pmml4s_2.11-0.9.7.jar
spray-json_2.11-1.3.5.jar

Then the transform function works! p.s. Spark 2.4.4 with those jars didn't work.