Closed irene3030 closed 4 years ago
@irene3030 There was an issue with the way that mleap handles sparse rows when predicting with xgboost. The pull request was recently merged a couple of days ago. (PR-205 ) Can you build master and check if this is still an issue?
Hello again,
First of all: thank you very much for your input and for fixing this issue.
I did exactly what you told me: I downloaded master branch (0.16.0-SNAPSHOT) and built the whole project. It worked like a charm! I do not longer have the problem I had and the predictions are the same than the ones obtained using Spark.
I did have one issue FYI (just in case anyone bumps into this as well): I had to manually package some of the modules: mleap-xgboost-spark sbt mleap-xgboost-spark/package
& mleap-xgboost-runtime sbt mleap-xgboost-runtime/package
, since trying to package from root directory did not included those modules.
My colleagues and I are very grateful :)
great, thanks @talalryz for all your help!
is it ok to close this issue if that's alright, changes will be included in the next release.
This is great, thanks @talalryz !
We, at Yelp, had been struggling with this bug ourselves so we're glad we could help others out along the way :)
Hello,
I am currently working in a project where a machine learning model has been created using Apache Spark & XGBoost4J. In order to deploy this model in a productive environment, I've used MLeap and its extension for XGBoost to serialize my pipeline, which include the following modules: StringIndexer, OneHotEncoderEstimator, VectorAssembler and a XGBoost regression model.
When reading the MLeap Bundle object I find that the predictions obtained using the serialized XGBoost model included in this object are very different than the ones obtained using the model XGBoost directly with Spark & XGboost4J-Spark.
Here is how I create my pipeline, train the model and wrap it in a MLeap object:
(Just in case it is not clear, PortatilesModelConstants contains constants such as the name of the columns I am working with). And here you may find how I reading the MLeap object and testing the pipeline using the testSet. First I obtain my test set transformed through the serialized pipeline. Then I transform it back to Spark DataFrame and compute "MAE" metric :
And both the metrics and predicted values obtained with testSetTransformed and testSetTransformed2 are different:
Here you have a small sample of the test data, showing that the predictions are different:
Attached to this message, you may find
I would very much appreciate any help you could give me. Thanks a lot, Irene mleap_issue.zip