dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.23k stars 8.72k forks source link

Cannot Load model using PySpark xgboost4j #6631

Closed WillSmisi closed 2 years ago

WillSmisi commented 3 years ago

after saving the model and loading getting the following error

IllegalArgumentException: u'requirement failed: Error loading metadata: Expected class name org.apache.spark.ml.Pipeline but found class name org.apache.spark.ml.PipelineModel'

can you please help with this . thanks

import pyspark
from pyspark.sql.session import SparkSession
from pyspark.sql.types import *
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.sql.functions import col

spark.sparkContext.addPyFile("sparkxgb.zip")
from sparkxgb import XGBoostEstimator
schema = StructType(
  [StructField("PassengerId", DoubleType()),
    StructField("Survival", DoubleType()),
    StructField("Pclass", DoubleType()),
    StructField("Name", StringType()),
    StructField("Sex", StringType()),
    StructField("Age", DoubleType()),
    StructField("SibSp", DoubleType()),
    StructField("Parch", DoubleType()),
    StructField("Ticket", StringType()),
    StructField("Fare", DoubleType()),
    StructField("Cabin", StringType()),
    StructField("Embarked", StringType())
  ])

df_raw = spark\
  .read\
  .option("header", "true")\
  .schema(schema)\
  .csv("train.csv")

 df = df_raw.na.fill(0)

 sexIndexer = StringIndexer()\
  .setInputCol("Sex")\
  .setOutputCol("SexIndex")\
  .setHandleInvalid("keep")

cabinIndexer = StringIndexer()\
  .setInputCol("Cabin")\
  .setOutputCol("CabinIndex")\
  .setHandleInvalid("keep")

embarkedIndexer = StringIndexer()\
  .setInputCol("Embarked")\
  .setOutputCol("EmbarkedIndex")\
  .setHandleInvalid("keep")

vectorAssembler = VectorAssembler()\
  .setInputCols(["Pclass", "SexIndex", "Age", "SibSp", "Parch", "Fare", "CabinIndex", "EmbarkedIndex"])\
  .setOutputCol("features")
xgboost = XGBoostEstimator(
    featuresCol="features", 
    labelCol="Survival", 
    predictionCol="prediction"
)

pipeline = Pipeline().setStages([sexIndexer, cabinIndexer, embarkedIndexer, vectorAssembler, xgboost])
model = pipeline.fit(df)
model.transform(df).select(col("PassengerId"), col("prediction")).show()

model.save("model_xgboost")
loadedModel = Pipeline.load("model_xgboost")

IllegalArgumentException: u'requirement failed: Error loading metadata: Expected class name org.apache.spark.ml.Pipeline but found class name org.apache.spark.ml.PipelineModel'

#predict2 = loadedModel.transform(df)

Tried the following option

from pyspark.ml import PipelineModel
#model.save("model_xgboost")
loadedModel = PipelineModel.load("model_xgboost")

Getting the following errror

No module named ml.dmlc.xgboost4j.scala.spark

Originally posted by @anaveenan in https://github.com/dmlc/xgboost/issues/1698#issuecomment-420472146

WillSmisi commented 3 years ago

I meet the same problem.

I am searching for a long time on net. But no use. Please help or try to give some ideas how to achieve this.

thanks in advance.

WillSmisi commented 3 years ago

Background

I have a small PySpark program that uses xgboost4j and xgboost4j-spark in order to train a given dataset in a spark dataframe form.

The training and saving is done, but It seems I cannot load the model.

Current libraries versions:

The main process is as follow:

trainingData, testData = data.randomSplit([0.7,0.3])
vectorAssembler = VectorAssembler()
      .setInputCols(numeric_features_new) 
      .setOutputCol(FEATURES)
scaler = MinMaxScaler(inputCol = FEATURES,
                      outputCol = FEATURES + '_scaler')
assemblerInputCols = FEATURES + '_scaler'

xgb_params = dict(
        eta=0.1,
        maxDepth=2,
        missing=0.0,
        objective="binary:logistic",
        numRound=5,
        numWorkers=1
    )

xgb = (
      XGBoostClassifier(**xgb_params)
          .setFeaturesCol(assemblerInputCols)
          .setLabelCol(LABEL)
  )

pipeline = Pipeline(stages=[
             vectorAssembler,
             scaler,
             xgb
           ])
print "training model"
pipline_model = pipeline.fit(trainingData)
print "saving model to S3"
pipline_model.write().overwrite().save(modelOssDir)
print "saved model to S3"
print "Loading model..."
pipline_model = PipelineModel.load(modelOssDir)

The error I get:

Traceback (most recent call last):
  File "xgboost.py", line 95, in <module>
    pipline_model = PipelineModel.load(modelOssDir)
  File "/home/admin/1610603211241401722_0/pyspark.zip/pyspark/ml/util.py", line 362, in load
  File "/home/admin/1610603211241401722_0/pyspark.zip/pyspark/ml/pipeline.py", line 242, in load
  File "/home/admin/1610603211241401722_0/pyspark.zip/pyspark/ml/util.py", line 304, in load
  File "/home/admin/1610603211241401722_0/pyspark.zip/pyspark/ml/pipeline.py", line 299, in _from_java
  File "/home/admin/1610603211241401722_0/pyspark.zip/pyspark/ml/wrapper.py", line 227, in _from_java
  File "/home/admin/1610603211241401722_0/pyspark.zip/pyspark/ml/wrapper.py", line 221, in __get_class
ImportError: No module named ml.dmlc.xgboost4j.scala.spark
at com.aliyun.odps.cupid.CupidUtil.errMsg2SparkException(CupidUtil.java:50)
    at com.aliyun.odps.cupid.CupidUtil.getResult(CupidUtil.java:131)
    at com.aliyun.odps.cupid.requestcupid.YarnClientImplUtil.pollAMStatus(YarnClientImplUtil.java:108)
    at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.applicationReportTransform(YarnClientImpl.java:377)
    ... 12 more
21/01/22 11:39:21 ERROR Client: Application diagnostics message: Failed to contact YARN for application application_1611286494541_745555769.
Exception in thread "main" org.apache.spark.SparkException: Application application_1611286494541_745555769 finished with failed status
    at org.apache.spark.deploy.yarn.Client.run(Client.scala:1166)
    at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1543)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

I am searching for a long time on net.But no use. Please help or try to give some ideas how to achieve this.

thanks in advance.

trivialfis commented 2 years ago

Closing as PySpark is not supported at the moment.

Luka3K commented 2 years ago

Closing as PySpark is not supported at the moment.

1698 (comment)

Have you resolved this problem? I meet the same one and have done a lot of search while still find nothing to fix it.

zzwtop1 commented 1 year ago

“pyspark.sql.utils.IllegalArgumentException: requirement failed: Error loading metadata: Expected class name org.apache.spark.ml.classification.RandomForestClassificationModel but found class name org.apache.spark.ml.classification.RandomForestClassifier” same problem

trivialfis commented 1 year ago

Random forest classification model, are you sure this is related to xgboost?