Closed iamyihwa closed 4 months ago
Hey. I'm not sure if there's a guarantee that the local and distributed models will be the same. Once you have both you can compare them with the trees_to_dataframe method, e.g.
dstr_df = fcst_dstr.models_['SparkXGBForecast'].get_booster().trees_to_dataframe()
local_df = fcst.models_['XGBregressor'].get_booster().trees_to_dataframe()
About the predict method, how are you running it?
Sure! Thanks I will try checking the trees!
Another thing I noticed was that for SparkXGBForecast, it was giving different result with the same dataset, whereas non distributed version it was consistent results. I didn't set any random seed to neither of the models ..
from xgboost.spark import SparkXGBRegressor
spark_xgb = SparkXGBRegressor(num_workers=8, label_col='target', features_col='features')
xgb_regressor_model = spark_xgb.fit(train_sf_tf)
transformed_test_spark_dataframe = spark_xgb.predict(test_spark_dataframe)
With the last line, I am getting this error. AttributeError: 'SparkXGBRegressor' object has no attribute 'predict'
I think you should use the trained model, i.e. xgb_regressor_model.predict
instead of spark_xgb.predict
Still getting the same error when I do ..
transformed_test_spark_dataframe = xgb_regressor_model.predict(test_spark_dataframe)
AttributeError: 'SparkXGBRegressorModel' object has no attribute 'predict'
Hmm, if it's an MLlib estimator it may be called transform
instead of predict
. Can you try that? If it works I think you should open an issue in XGBoost so that they update the documentation.
Hmm, if it's an MLlib estimator it may be called transform instead of predict. Can you try that? If it works I think you should open an issue in XGBoost so that they update the documentation.
Yes, the spark interface uses the name transform
instead of predict
to align with sparkml. Did you find the document being inconsistent in XGBoost?
Hey. I'm not sure if there's a guarantee that the local and distributed models will be the same
No, the training results are expected to differ. However, given a trained model, predictions are expected to be the same.
Did you find the document being inconsistent in XGBoost?
@iamyihwa referred to this document which seems to be outdated, since the first object is called spark_reg_estimator
but then fit
and predict
are called on xgb_regressor
. @trivialfis if you agree that these are indeed wrong I can work on a fix.
No, the training results are expected to differ
This was my impression as well. So I believe we can close this issue since the differences are expected.
Thank you for pointing it out! A PR is welcomed!
What happened + What you expected to happen
Hi, I am testing xgboost model in mlforecast library both in distributed (pyspark), and non-distributed and it gives quite different result. See the two columns.
I understand that they come from xgboost library, not from mlforecast library, so created an issue there as well. While I was creating an example, I noticed that I am getting an error when I try to use predict() for the spark version ('SparkXGBRegressor' object has no attribute 'predict' although it exists in their example )
Any tips??
Versions / Dependencies
0.11.6
Reproduction script