Open VincentThiemann opened 1 year ago
Is the Bayes HPO framework using multiple threads? If so, the access to the DMatrix object can be problematic.
Yes, it is using multiple threads. I can limit it to one thread to see if that would stop the segmentation faults. How would multiple threads accessing the DMatrix object cause a segmentation fault? I have seen a variety of different guides and documentation online using multi-threaded bayes search on xgboost.
It's fine to use multiple threads only if you create the dmatrix within each thread. Normally people do it with skl estimators like XGBClassifier
along with skl utilities like cross validation score.
I will submit a patch to assist concurrent access, probably simply prohibit the use.
I have replaced xgb.cv with XGBClassifier and cross_val_score, and I have also tried placed the matrix creation within the xgb_cv black box function, however, I am still getting the same seg fault.
I have also attempted switching my Bayes opt library to scikit-optimize and limiting it to a single thread to no avail (same seg fault).
Also, sorry for the delay in responses, I work more than one job and don't always have access to the computer I'm doing this on.
Got it, looking into this. Thank you for sharing!
Hi, can I access the dataset (maybe in private)? So far, using:
from sklearn.datasets import make_classification
n_classes = 3
X, y = make_classification(
n_samples=int(1e6), n_features=118, n_classes=n_classes, n_informative=96
)
The script is running fine (still running). Alternatively, could you please help verify that using the above snippet for generating dataset also segfault on your system?
By the way, I fixed the random_state
to 1994 for both data generation and Bayes optimization.
I'm stopping the run now. Haven't been able to reproduce it.
I will run the above snippet and see if it works or not soon (likely in a few hours). I have found that simple GridSearchCV is also returning the same seg fault, so it is not just relegated to Bayes optimization.
Interestingly, it seems that XGBoost.cv and sklearn cross validation works for me when not used within some sort of hyper parameter tuning function (GridSearch, RandomSearch, Bayes, etc.).
I'll edit this comment with what occurs when I run the snippet you sent @trivialfis. I'll also add the broken GridSearchCV code I used and the code in which cross validation does work.
Thank you for testing it. Please keep us posted, I will dig into it once I can reproduce the issue.
I am trying to hyperparameter tune an XGBoost model using the bayesian-optimization library, and I continually get a segmentation fault during xgboost cross validation, regardless of how large or small my training data is.
I have a dataset with 118 features and about 1.7 million data points, which takes up about 5.6gb of space. Whenever I run the following code, I always get a segmentation error:
I've used faulthandler to trace back the last calls before the segmentation fault, and it outputs one of two traces:
or
For the first segmentation fault, line 1918 in core.py (the last traced call) calls XGBoosterUpdateOneIter in xgboost's C library.
For the second segmentation fault, line 1989 in core.py calls XGBoosterEvalOneIter in xgboost's C library.
The computer I am running on has 64gb or memory. Each time I run this code, I monitor the amount of memory it consumes, and the highest memory usage I've seen is 22.3gb. I have tried reducing the amount of data I use down to 50%, 33%, 25%, 10%, 5% and 1% to no avail.
Sometimes, the code is able to do a few (2-4) iterations of bayes optimization, however it always gets the segmentation fault.
I have used mamba to setup my environment, and these are the following versions of each relevant package I am using: python3 = 3.11.4 xgboost = 1.7.4 scikit-learn = 1.3.0 numpy = 1.25.0 pandas = 2.0.3
The local machine I am using run this code has Ubuntu 22.04 installed, 64gb of memory, and an intel 13th gen processor.