Open pangjac opened 10 months ago
Hi, are you using DaskQuantileDMatrix
?
Hi, No. I use regular DaskDMatrix. The way to construct dtrain and dvalid is below
import dask.array as da
import dask.dataframe as dd
# tra and itv are regular pandas dataframe
# `feature_list` is a list of strings of feature names
# `dependent` is the string of target variable name
tra = tra.fillna(-999999999)
itv = itv.fillna(-999999999)
X_train = tra[feature_list]
y_train = tra[dependent]
X_test = itv[feature_list]
y_test = itv[dependent]
# construct dask dataframe from regular pandas dataframe
X_train_dd = dd.from_pandas(X_train, npartitions=8) #set npartitions=8 because GPU 8
y_train_dd = dd.from_pandas(y_train, npartitions=8)
X_test_dd = dd.from_pandas(X_test, npartitions=8)
y_test_dd = dd.from_pandas(y_test, npartitions=8)
# construct DaskDMatrix
import xgboost as xgb
dtrain = xgb.dask.DaskDMatrix(client, data=X_train_dd, label=y_train_dd, missing=-999999999)
dvalid = xgb.dask.DaskDMatrix(client, data=X_test_dd, label=y_test_dd, missing=-999999999)
Memory usage issues can happen when there are too many experiments running concurrently. Let's see if we can mitigate it somehow.
tra = tra.fillna(-999999999)
I'm not sure why is this necessary. XGBoost can handle NA. You might have just created an intermediate copy of the data.
X_train_dd = dd.from_pandas(X_train, npartitions=8) #set npartitions=8 because GPU 8
Feel free to use more partitions for granularity in scheduling. I don't think dask can perform its best when the number of partitions is strictly the same as the number of GPUs. Every operation has to be done on the entire partition, which can be memory-hungry.
X_train_dd = dd.from_pandas(X_train, npartitions=8) #set npartitions=8 because GPU 8
Also, it's usually considered best practice to not use from_pandas
, instead one should prefer using dask from the beginning to avoid concentration of data and large data transfer.
dtrain = xgb.dask.DaskDMatrix(client, data=X_train_dd, label=y_train_dd, missing=-999999999)
Consider using DaskQuantileDMatrix
, which saves a significant amount of memory when you are using the hist
tree method.
I am using XGBoost Dask to train a regression model.
I use
Optuna
to tune the process to find the best parameter. Once you defined the functionobjective
, below is a typical Optuna tuning structure to find the parameters. I noticed thatn_trials
max possible value is 240, onceoptuna
achieve the 240th trial, I will getcudaErrorMemoryAllocation
error:[Training Env] I am using AWS EC2
g5.48xlarge
instance, which is a multi GPU (GPU 8, GPU Memory 192G, vCPU 192G, Memory 768G). When setting dask client, I haveA full throw-out error log from jupyter console is below.