Closed kylejn27 closed 4 years ago
Can you give an example of what evals_result
does? I don't understand their docs. It's a parameter that's passed into train
and mutated inplace?
Its a history of the evaluations on each iteration. From what I understand, for every iteration of the train step, the resulting evaluation metric is appended to this evals_result
dictionary.
evals_result is added to a record_evaluation
callback here
https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/training.py#L207
Here's the callback code in dmlc/xgboost: https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/callback.py#L60
Here's an example of what should work with dask-xgboost but isn't currently implemented
import pandas as pd
from sklearn. model_selection import train_test_split
from sklearn.datasets import make_classification
import dask.dataframe as dd
from dask.distributed import Client
import dask_xgboost as dxgb
# Make Client
client = Client()
# Data setup
data = make_classification(
n_samples=1000,
n_features=20
)
X = pd.DataFrame(data[0])
X.columns = [f'var{i}' for i in range(20)]
y = pd.DataFrame(data[1])
y.columns = ['target']
X_train, X_test, y_train, y_test = train_test_split(X, y)
# Convert train set to dask dataframes
X_train = dd.from_pandas(X_train, npartitions=1)
y_train = dd.from_pandas(y_train, npartitions=1)
# Model train
model = dxgb.XGBClassifier()
eval_set = [[X_test, y_test]]
eval_metric="logloss"
model.fit(
X_train,
y_train,
classes=[0, 1],
early_stopping_rounds=4,
eval_set=eval_set,
eval_metric=eval_metric
)
>>> print(model.evals_result())
{'validation_0': {'logloss': [0.636035, 0.588901, 0.550328, 0.520794, 0.490704, 0.466473, 0.444285, 0.424012, 0.407814, 0.392962, 0.382754, 0.374211, 0.36438, 0.35752, 0.353132, 0.349372, 0.343062, 0.338768, 0.336939, 0.334325, 0.330798, 0.330391, 0.329221, 0.329147, 0.326469, 0.325981, 0.325691, 0.32589, 0.326658, 0.326615]}}
Thanks for the clear example.
My main questions now are: where are the evals evaluated, and does the order matter? If they're evaluated on the workers as part of distributed training, then I don't think we can make any guarantee about the order of these results as they come in (I could be misunderstanding what happens though).
I'm still learning how xgboost (and distributed xgboost) works, so I could be incorrect but I'll try to explain this to the best of my ability
where are the evals evaluated
The evaluations are triggered on each worker as part of a post bst.update
call to the bst.eval_set
method. This is done in the boost round loop.
does the order matter? If they're evaluated on the workers as part of distributed training, then I don't think we can make any guarantee about the order of these results as they come in
I believe order matters, each value in the eval list represents one of the iterations of the train portion of the algorithm. You'd want to see how the model progressed over time on each iteration, jumbling that up would make the result unusable.
I don't fully understand the underlying distributed xgboost algorithm but if the model is updated on each worker after each train iteration so that between rounds the model is identical then the evals results should be identical across all of the workers. I can't point to a spot in the code that proves this but In my testing of this the results have been deterministic and in the right order.
Hello,
Currently the dask-xgboost package train result does not return
evals_result
.I'm thinking it can be implemented in a similar way to https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/dask.py#L348
I'd be happy to open a PR with this change myself, but I'd like to get feedback on what your thoughts are on this implementation because I imagine having the existing
train
method return a dictionary rather than the booster object will cause breaking changes for those who are using this library currently. If this package will be moving to dmlc/xgboost anyways then maybe this is acceptable, otherwise there's probably cleaner way to returnevals_result
to the user