dask / dask-xgboost

BSD 3-Clause "New" or "Revised" License
162 stars 43 forks source link

Verify the benchmark of XgboostClassifier with initial xgboost #44

Open xiaozhongtian opened 5 years ago

xiaozhongtian commented 5 years ago

Hello, I find maybe a bug about the XgboostClassifier in dask.xgboost.

from sklearn.datasets import load_iris
import dask.dataframe as dd
import pandas as pd
dataset = load_iris()
train = dataset.data
target = dataset.target

pdf = pd.DataFrame(data = train,columns=["1","2","3","4"])
pdf_y = pd.Series(target)

# pass the multi-class to binary problem to easily show the bug.
pdf_y.replace(2,1,inplace =True) 

from xgboost import XGBClassifier
est = XGBClassifier(n_estimators=30,max_depth=7,verbosity=0,learning_rate= 0.1)

est.fit(pdf, pdf_y)
est.score(pdf, pdf_y)

with the intial xgboost , we can easily get 100% accuracy.

from dask_ml.xgboost import XGBClassifier
from distributed import Client

client = Client()
est = XGBClassifier(n_estimators=30,max_depth=7,verbosity=1,learning_rate= 0.1)
df = dd.from_pandas(pdf,chunksize=640000)
df_y = dd.from_pandas(pdf_y,chunksize=640000).astype(int)
est.fit(df, df_y )
est.score(df, df_y )

with the same parameter and the same data, we can only get 66% accuracy and the problem is that the estimator with predict() only returns 1 all the time. The 66% have no sense.

This is a simple example to show the bug. I have tested on my project with titanic dataset and it has the same problem.

est.predict(df).compute() return 1 for all the df.

TomAugspurger commented 5 years ago

Does the same issue affect distributed XGBoost without dask (e.g. https://xgboost.readthedocs.io/en/release_0.72/tutorials/aws_yarn.html)?

xiaozhongtian commented 5 years ago

I haven't tried it, maybe i will try next Monday. but I found that https://xgboost.readthedocs.io/en/release_0.72/tutorials/aws_yarn.html is not existed in the latest version. https://xgboost.readthedocs.io/en/latest/tutorials/aws_yarn.html It's really intresting.