Segmentation fault during xgb.cv regardless of data size

VincentThiemann commented 1 year ago

I am trying to hyperparameter tune an XGBoost model using the bayesian-optimization library, and I continually get a segmentation fault during xgboost cross validation, regardless of how large or small my training data is.

I have a dataset with 118 features and about 1.7 million data points, which takes up about 5.6gb of space. Whenever I run the following code, I always get a segmentation error:

#Import and transform data for bayes opt tuning

import os
import sys
#append path
current_dir = os.getcwd()
sys.path.append(current_dir)
import faulthandler
import sklearn
from sklearn.preprocessing import MinMaxScaler
os.environ['KMP_DUPLICATE_LIB_OK']='True'
import xgboost as xgb
from bayes_opt import BayesianOptimization
import numpy as np
import pandas as pd

faulthandler.enable()

print("Importing data")
transformed_data = pd.read_csv(os.getcwd() + "//saved_dataframes/1000_v3")

transformed_data = transformed_data.sample(frac = 1).reset_index(drop=True)

scaler = MinMaxScaler().set_output(transform="pandas")

full_y = transformed_data[['target']].to_numpy()
transformed_data = transformed_data.drop('target', axis=1)
full_x = scaler.fit_transform(transformed_data)
full_y = full_y.reshape(full_y.shape[0], 1)
del transformed_data

dtrain = xgb.DMatrix(full_x, label=full_y)
del full_x, full_y

# Define the objective function for Bayesian optimization
def xgb_cv(max_depth, learning_rate, subsample, colsample_bytree, colsample_bylevel, min_child_weight, max_delta_step, reg_lambda, reg_alpha, gamma, n_estimators):
    params = {'objective': 'multi:softprob',
              'num_class': 3,
              'tree_method': 'approx',
              'max_depth': int(max_depth),
              'learning_rate': learning_rate,
              'subsample': subsample,
              'colsample_bytree': colsample_bytree,
              'colsample_bylevel': colsample_bylevel,
              'min_child_weight': min_child_weight,
              'max_delta_step': int(max_delta_step),
              'reg_lambda': reg_lambda,
              'reg_alpha': reg_alpha,
              'gamma': gamma
        }
    cv_result = xgb.cv(params, dtrain, num_boost_round=int(n_estimators), early_stopping_rounds=10, nfold=5, metrics='auc')
    return -cv_result['test-auc-mean'].iloc[-1]

pbounds = { 'learning_rate': (0.001, 1.0),
            'min_child_weight': (0, 10),
            'max_depth': (3, 20),
            'max_delta_step': (0, 20),
            'subsample': (0.25, 1.0),
            'colsample_bytree': (0.1, 1.0),
            'colsample_bylevel': (0.1, 1.0),
            'reg_lambda': (0, 1000.0),
            'reg_alpha': (0, 1000.0),
            'gamma': (0, 20),
            'n_estimators': (50, 400) 
        }

# Create a BayesianOptimization object and run the optimization
print('Performing hyperparameter tuning using Bayesian optimization...')
optimizer = BayesianOptimization(f=xgb_cv, pbounds=pbounds, verbose = 10)
optimizer.maximize(init_points=5, n_iter=300)

print(optimizer.max)

I've used faulthandler to trace back the last calls before the segmentation fault, and it outputs one of two traces:

Fatal Python error: Segmentation fault

Thread 0x00007f9cfecf6740 (most recent call first):
  File "/home/vincent/mambaforge/envs/stock_algorithm_env/lib/python3.11/site-packages/xgboost/core.py", line 1918 in update
  File "/home/vincent/mambaforge/envs/stock_algorithm_env/lib/python3.11/site-packages/xgboost/training.py", line 215 in update
  File "/home/vincent/mambaforge/envs/stock_algorithm_env/lib/python3.11/site-packages/xgboost/training.py", line 229 in update
  File "/home/vincent/mambaforge/envs/stock_algorithm_env/lib/python3.11/site-packages/xgboost/training.py", line 538 in cv
  File "/home/vincent/stock_algorithm.py", line 63 in xgb_cv
  File "/home/vincent/mambaforge/envs/stock_algorithm_env/lib/python3.11/site-packages/bayes_opt/target_space.py", line 236 in probe
  File "/home/vincent/mambaforge/envs/stock_algorithm_env/lib/python3.11/site-packages/bayes_opt/bayesian_optimization.py", line 208 in probe
  File "/home/vincent/mambaforge/envs/stock_algorithm_env/lib/python3.11/site-packages/bayes_opt/bayesian_optimization.py", line 310 in maximize
  File "/home/vincent/stock_algorithm.py", line 82 in <module>

or

Fatal Python error: Aborted

Thread 0x00007fb44c9e2740 (most recent call first):
  File "/home/vincent/mambaforge/envs/stock_algorithm_env/lib/python3.11/site-packages/xgboost/core.py", line 1989 in eval_set
  File "/home/vincent/mambaforge/envs/stock_algorithm_env/lib/python3.11/site-packages/xgboost/training.py", line 219 in eval
  File "/home/vincent/mambaforge/envs/stock_algorithm_env/lib/python3.11/site-packages/xgboost/training.py", line 233 in <listcomp>
  File "/home/vincent/mambaforge/envs/stock_algorithm_env/lib/python3.11/site-packages/xgboost/training.py", line 233 in eval
  File "/home/vincent/mambaforge/envs/stock_algorithm_env/lib/python3.11/site-packages/xgboost/callback.py", line 232 in after_iteration
  File "/home/vincent/mambaforge/envs/stock_algorithm_env/lib/python3.11/site-packages/xgboost/training.py", line 540 in cv
  File "/home/vincent/stock_algorithm.py", line 62 in xgb_cv
  File "/home/vincent/mambaforge/envs/stock_algorithm_env/lib/python3.11/site-packages/bayes_opt/target_space.py", line 236 in probe
  File "/home/vincent/mambaforge/envs/stock_algorithm_env/lib/python3.11/site-packages/bayes_opt/bayesian_optimization.py", line 208 in probe
  File "/home/vincent/mambaforge/envs/stock_algorithm_env/lib/python3.11/site-packages/bayes_opt/bayesian_optimization.py", line 310 in maximize
  File "/home/vincent/stock_algorithm.py", line 81 in <module>

For the first segmentation fault, line 1918 in core.py (the last traced call) calls XGBoosterUpdateOneIter in xgboost's C library.

For the second segmentation fault, line 1989 in core.py calls XGBoosterEvalOneIter in xgboost's C library.

The computer I am running on has 64gb or memory. Each time I run this code, I monitor the amount of memory it consumes, and the highest memory usage I've seen is 22.3gb. I have tried reducing the amount of data I use down to 50%, 33%, 25%, 10%, 5% and 1% to no avail.

Sometimes, the code is able to do a few (2-4) iterations of bayes optimization, however it always gets the segmentation fault.

I have used mamba to setup my environment, and these are the following versions of each relevant package I am using: python3 = 3.11.4 xgboost = 1.7.4 scikit-learn = 1.3.0 numpy = 1.25.0 pandas = 2.0.3

The local machine I am using run this code has Ubuntu 22.04 installed, 64gb of memory, and an intel 13th gen processor.

trivialfis commented 1 year ago

Is the Bayes HPO framework using multiple threads? If so, the access to the DMatrix object can be problematic.

VincentThiemann commented 1 year ago

Yes, it is using multiple threads. I can limit it to one thread to see if that would stop the segmentation faults. How would multiple threads accessing the DMatrix object cause a segmentation fault? I have seen a variety of different guides and documentation online using multi-threaded bayes search on xgboost.

trivialfis commented 1 year ago

It's fine to use multiple threads only if you create the dmatrix within each thread. Normally people do it with skl estimators like XGBClassifier along with skl utilities like cross validation score.

I will submit a patch to assist concurrent access, probably simply prohibit the use.

VincentThiemann commented 1 year ago

I have replaced xgb.cv with XGBClassifier and cross_val_score, and I have also tried placed the matrix creation within the xgb_cv black box function, however, I am still getting the same seg fault.

I have also attempted switching my Bayes opt library to scikit-optimize and limiting it to a single thread to no avail (same seg fault).

Also, sorry for the delay in responses, I work more than one job and don't always have access to the computer I'm doing this on.

trivialfis commented 1 year ago

Got it, looking into this. Thank you for sharing!

trivialfis commented 1 year ago

Hi, can I access the dataset (maybe in private)? So far, using:

from sklearn.datasets import make_classification

n_classes = 3
X, y = make_classification(
    n_samples=int(1e6), n_features=118, n_classes=n_classes, n_informative=96
)

The script is running fine (still running). Alternatively, could you please help verify that using the above snippet for generating dataset also segfault on your system?

By the way, I fixed the random_state to 1994 for both data generation and Bayes optimization.

trivialfis commented 1 year ago

I'm stopping the run now. Haven't been able to reproduce it.

VincentThiemann commented 1 year ago

I will run the above snippet and see if it works or not soon (likely in a few hours). I have found that simple GridSearchCV is also returning the same seg fault, so it is not just relegated to Bayes optimization.

Interestingly, it seems that XGBoost.cv and sklearn cross validation works for me when not used within some sort of hyper parameter tuning function (GridSearch, RandomSearch, Bayes, etc.).

I'll edit this comment with what occurs when I run the snippet you sent @trivialfis. I'll also add the broken GridSearchCV code I used and the code in which cross validation does work.

trivialfis commented 1 year ago

Thank you for testing it. Please keep us posted, I will dig into it once I can reproduce the issue.

dmlc / xgboost

Segmentation fault during xgb.cv regardless of data size #9369