[Model] Distributed and Non-distributed version of the models giving different result

iamyihwa commented 5 months ago

What happened + What you expected to happen

Hi, I am testing xgboost model in mlforecast library both in distributed (pyspark), and non-distributed and it gives quite different result. See the two columns.

I understand that they come from xgboost library, not from mlforecast library, so created an issue there as well. While I was creating an example, I noticed that I am getting an error when I try to use predict() for the spark version ('SparkXGBRegressor' object has no attribute 'predict' although it exists in their example )

Any tips??

Versions / Dependencies

0.11.6

Reproduction script

from mlforecast.distributed import DistributedMLForecast
from mlforecast.target_transforms import Differences
from mlforecast.utils import generate_daily_series, generate_prices_for_series
numPartitions = 4
series = generate_daily_series(100, n_static_features=2, equal_ends=True, static_as_categorical=False)
series.iteritems = series.items
spark_series = spark.createDataFrame(series).repartitionByRange(numPartitions, 'unique_id')

from mlforecast.distributed.models.spark.xgb import SparkXGBForecast
from window_ops.expanding import expanding_mean
models = [ SparkXGBForecast()]

fcst_dstr = DistributedMLForecast(
    models,
    freq='D',
    lags=[1],
    lag_transforms={
        1: [expanding_mean]
    },
    date_features=['dayofweek'],
)
fcst_dstr.fit(
    spark_series,
    static_features=['static_0', 'static_1'],
)

preds_dstr = fcst_dstr.predict(14).toPandas()

from mlforecast import MLForecast
from mlforecast.lag_transforms import RollingMean, ExpandingStd, ExpandingMean, ExponentiallyWeightedMean, SeasonalRollingMean
from xgboost import XGBRegressor
from mlforecast.target_transforms import Differences

models = [  XGBRegressor() ]

fcst = MLForecast(
    models,
    freq='D',
    lags=[1],
    lag_transforms={
        1: [expanding_mean]
    },
    date_features=['dayofweek'],
)
fcst.fit(
    series,
    static_features=['static_0', 'static_1'],
)

preds = fcst.predict(14)

import pandas as pd 
joined = pd.merge(preds, preds_dstr, on = ['unique_id', 'ds'])

joined.tail()



### Issue Severity

None

jmoralez commented 5 months ago

Hey. I'm not sure if there's a guarantee that the local and distributed models will be the same. Once you have both you can compare them with the trees_to_dataframe method, e.g.

dstr_df = fcst_dstr.models_['SparkXGBForecast'].get_booster().trees_to_dataframe()
local_df = fcst.models_['XGBregressor'].get_booster().trees_to_dataframe()

About the predict method, how are you running it?

iamyihwa commented 5 months ago

Sure! Thanks I will try checking the trees!

Another thing I noticed was that for SparkXGBForecast, it was giving different result with the same dataset, whereas non distributed version it was consistent results. I didn't set any random seed to neither of the models ..

from xgboost.spark import SparkXGBRegressor
spark_xgb = SparkXGBRegressor(num_workers=8, label_col='target', features_col='features')
xgb_regressor_model = spark_xgb.fit(train_sf_tf)
transformed_test_spark_dataframe = spark_xgb.predict(test_spark_dataframe)

With the last line, I am getting this error. AttributeError: 'SparkXGBRegressor' object has no attribute 'predict'

jmoralez commented 5 months ago

I think you should use the trained model, i.e. xgb_regressor_model.predict instead of spark_xgb.predict

iamyihwa commented 5 months ago

Still getting the same error when I do .. transformed_test_spark_dataframe = xgb_regressor_model.predict(test_spark_dataframe) AttributeError: 'SparkXGBRegressorModel' object has no attribute 'predict'

jmoralez commented 5 months ago

Hmm, if it's an MLlib estimator it may be called transform instead of predict. Can you try that? If it works I think you should open an issue in XGBoost so that they update the documentation.

trivialfis commented 4 months ago

Hmm, if it's an MLlib estimator it may be called transform instead of predict. Can you try that? If it works I think you should open an issue in XGBoost so that they update the documentation.

Yes, the spark interface uses the name transform instead of predict to align with sparkml. Did you find the document being inconsistent in XGBoost?

Hey. I'm not sure if there's a guarantee that the local and distributed models will be the same

No, the training results are expected to differ. However, given a trained model, predictions are expected to be the same.

jmoralez commented 4 months ago

Did you find the document being inconsistent in XGBoost?

@iamyihwa referred to this document which seems to be outdated, since the first object is called spark_reg_estimator but then fit and predict are called on xgb_regressor. @trivialfis if you agree that these are indeed wrong I can work on a fix.

No, the training results are expected to differ

This was my impression as well. So I believe we can close this issue since the differences are expected.

trivialfis commented 4 months ago

Thank you for pointing it out! A PR is welcomed!

Nixtla / mlforecast