dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.13k stars 8.7k forks source link

What happens to test_cv_fold when using "weight" to form the train dmatrix. #9111

Open abhishek0093 opened 1 year ago

abhishek0093 commented 1 year ago

Hello Community, I had a doubt which I would be very happy if anybody can help. When forming the dmatrix of the train dataset I used the parameter "weight" to set the weight of each sample in my training data. Now, I intend to use the xgboost.cv() feature to cross_validate my model . But I'm not confident how model treats the test_fold in this case due to "weight" parameter.

This is what my thinking is. "weight" parameter has only role to play while constructing the trees, and once our model is ready, it predicts the new incoming data based on the constructed tree. So if I use 5 fold cross_validation, "weight" will play role in only 4 train_folds , while the 5th testing_fold would be judged only on the basis of True_labels and predicted_labels and so "weight" wouldn't have any role to play there.

Please comment your views upon the same. Thankyou everybody in advance. 😄

trivialfis commented 1 year ago

The weight is also used for evaluation. (calculating the metric)

abhishek0093 commented 1 year ago

Hi @trivialfis, do you say the weights assigned to the test_fold of cross_validation will also play role while evaluating the perfomance on k-fold cross_validation sets. I'm not sure how that is done. Can you help me to understand that.

Let's say we have {x1...x10} data points with corresponding weights of {w1, w2, ... w10}. We split the data into train, test such that {(x1, w1)...(x8, w8)} ∈ train, and {(x9, w9), (x10, w10)} ∈ test. Now we do 5 fold cv to the train_data , so let's say at a particular instant we got {(x1, w1) .... (x6, w6)} in train_fold and {(x7, w7), (x8, w8)} in test_fold. Now , the model gets trained on train_fold and is ready. We validate it for test data points and let's say prediction on test_fold be x7' and x8', so the cv_rmse would be sqrt(((x7'- x7)^2 + (x8^2 - x8)^2)/2). I don't get how w7 and w8 fits into this equation.

I believe they might play role in another train_fold (when data is splitted differently) in training the model, but if the data point is in test_fold of cv set or test_data, the weights there don't play role in model building/evaluation.

Please correct me if I'm wrong here.