NVIDIA-Merlin / NVTabular

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
Apache License 2.0
1.04k stars 143 forks source link

[BUG] fitting XGBoost killed by OOM error #1750

Closed bilzard closed 1 year ago

bilzard commented 1 year ago

Describe the bug

Training XGBoost model with dataset larger than system memory killed by OOM error.

Error message

MemoryError: std::bad_alloc: out_of_memory: RMM failure at:/project/include/rmm/mr/device/pool_memory_resource.hpp:193: Maximum pool size exceeded

Steps/Code to reproduce bug

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4289429 entries, 0 to 4289428
Columns: 112 entries, type to n_i_h23_s
dtypes: float32(107), int32(3), int8(1), uint8(1)
memory usage: 1.8 GB

Code sample:

import numba

if numba.cuda.is_available():
    NUM_GPUS = list(range(len(numba.cuda.gpus)))
else:
    NUM_GPUS = []
visible_devices = ",".join(
    [str(n) for n in NUM_GPUS]
)  # Delect devices to place workers
device_limit_frac = 0.5  # Spill GPU-Worker memory to host at this limit.
device_pool_frac = 0.6
part_mem_frac = 0.01

# Use total device size to calculate args.device_limit_frac
device_size = device_mem_size(kind="total")
device_limit = int(device_limit_frac * device_size)
device_pool_size = int(device_pool_frac * device_size)
part_size = int(part_mem_frac * device_size)

# Check if any device memory is already occupied
for dev in visible_devices.split(","):
    fmem = pynvml_mem_size(kind="free", index=int(dev))
    used = (device_size - fmem) / 1e9
    if used > 1.0:
        warnings.warn(f"BEWARE - {used} GB is already occupied on device {int(dev)}!")

train_ds = Dataset(
    [f"train_data/{i}.parquet" for i in range(2)], engine="parquet", part_size="128MB"
)
valid_ds = Dataset(f"valid_data/0.parquet", engine="parquet", part_size="128MB")

wf = Workflow(feature_cols + target + qid_column)
train_processed = wf.fit_transform(train_ds)
valid_processed = wf.transform(valid_ds)

ranker = XGBoost(
    train_processed.schema,
    objective="rank:pairwise",
    eval_metric="ndcg@20",
    booster="dart",
)
xgb_train_params = {
    "num_boost_round": 1,
}

with LocalCUDACluster(
    n_workers=1,  # Number of GPU workers
    device_memory_limit=device_limit,  # GPU->CPU spill threshold (~75% of GPU memory)
    rmm_pool_size=(device_pool_size // 256) * 256,  # Memory pool size on each worker
    local_directory="/nvme/scratch/",  # Fast directory for disk spilling
) as cluster:
    client = Client(cluster)
    ranker.fit(
        train_processed,
        evals=[
            (valid_processed, "validation_set"),
        ],
        **xgb_train_params,
    )

Expected behavior

Dask client doesn't stop by OOM killer.

Environment details (please complete the following information):

Additional context I followed an official tutorial^1, and tried bellow both of which doesn't solved the issue.

    train.to_parquet(
        train_root / f"{i}.parquet",
        index=False,
        engine="pyarrow",
        row_group_size=100_000,
    )
train_ds = Dataset(
    [f"train_data/{i}.parquet" for i in range(1)], engine="parquet", part_size="128MB"
)
rnyak commented 1 year ago

@bilzard which docker image are you using? you posted cudf == 22.12.0 but our latest docker images (e.g. merlin-tensorflow:22.12) should have cudf version 22.08.

Please check out the support matrix for the release versions: https://nvidia-merlin.github.io/Merlin/main/support_matrix/support_matrix_merlin_tensorflow.html

bilzard commented 1 year ago

@rnyak Thank you for sharing link. I re-run my script on both of these docker image, and the problem reproduced.

Error log (merlin-pytorch:22.12)

2023-01-20 09:41:37,916 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2023-01-20 09:41:37,916 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
/usr/local/lib/python3.8/dist-packages/merlin/schema/tags.py:148: UserWarning: Compound tags like Tags.USER_ID have been deprecated and will be removed in a future version. Please use the atomic versions of these tags, like [<Tags.USER: 'user'>, <Tags.ID: 'id'>].
  warnings.warn(
/usr/local/lib/python3.8/dist-packages/xgboost/dask.py:884: RuntimeWarning: coroutine 'Client._wait_for_workers' was never awaited
  client.wait_for_workers(n_workers)
/usr/local/lib/python3.8/dist-packages/distributed/worker_state_machine.py:3468: FutureWarning: The `Worker.nthreads` attribute has been moved to `Worker.state.nthreads`
  warnings.warn(
[09:42:08] task [xgboost.dask-0]:tcp://127.0.0.1:45449 got new rank 0
2023-01-20 09:42:14,881 - distributed.worker - WARNING - Compute Failed
Key:       dispatched_train-3fcc8297-d94c-4735-bde7-d827a023737c
Function:  dispatched_train
args:      ({'eval_metric': 'ndcg@20', 'objective': 'rank:pairwise'}, [b'DMLC_NUM_WORKER=1', b'DMLC_TRACKER_URI=172.18.0.2', b'DMLC_TRACKER_PORT=48471', b'DMLC_TASK_ID=[xgboost.dask-0]:tcp://127.0.0.1:45449'], 140281164174192, ['validation_set'], [140281164173904], {'feature_names': None, 'feature_types': None, 'feature_weights': None, 'missing': None, 'enable_categorical': False, 'parts': [{'data':        type  priority       wgt  ...     n_i_h22_s     n_i_h23_s  session
8         1        11  0.037885  ...  1.000000e-09  1.000000e-09       78
17        1         6  0.054992  ...  1.000000e-09  1.000000e-09       78
5         1        16  0.032280  ...  1.000000e-09  1.000000e-09       78
19        1        14  0.035469  ...  1.000000e-09  1.000000e-09       78
4         1        13  0.036726  ...  1.000000e-09  1.000000e-09       78
...     ...       ...       ...  ...           ...           ...      ...
74342     2       116  0.016393  ...  7.692308e-11  7.692308e-11    85679
74306     2     
kwargs:    {}
Exception: "XGBoostError('[09:42:09] ../src/c_api/../data/../common/device_helpers.cuh:428: Memory allocation error on worker 0: std::bad_alloc: cudaErrorMemoryAllocation: out of memory\\n- Free memory: 1238040576\\n- Requested memory: 4589462176\\n\\nStack trace:\\n  [bt] (0) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x3a5799) [0x7f2b94c4f799]\\n  [bt] (1) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x3a9bab) [0x7f2b94c53bab]\\n  [bt] (2) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x8dc49) [0x7f2b94937c49]\\n  [bt] (3) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x3edbeb) [0x7f2b94c97beb]\\n  [bt] (4) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x3ee759) [0x7f2b94c98759]\\n  [bt] (5) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x3ee993) [0x7f2b94c98993]\\n  [bt] (6) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x45b7f9) [0x7f2b94d057f9]\\n  [bt] (7) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x45bec0) [0x7f2b94d05ec0]\\n  [bt] (8) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x421b40) [0x7f2b94ccbb40]\\n\\n')"

---------------------------------------------------------------------------
XGBoostError                              Traceback (most recent call last)
Cell In[7], line 8
      1 with LocalCUDACluster(
      2     n_workers=1,  # Number of GPU workers
      3     device_memory_limit=device_limit,  # GPU->CPU spill threshold (~75% of GPU memory)
      4     rmm_pool_size=(device_pool_size // 256) * 256,  # Memory pool size on each worker
      5     local_directory="/nvme/scratch/",  # Fast directory for disk spilling
      6 ) as cluster:
      7     client = Client(cluster)
----> 8     ranker.fit(
      9         train_processed,
     10         evals=[
     11             (valid_processed, "validation_set"),
     12         ],
     13         **xgb_train_params,
     14     )

File /usr/local/lib/python3.8/dist-packages/merlin/models/xgb/__init__.py:178, in XGBoost.fit(self, train, evals, use_quantile, **train_kwargs)
    175     d_eval = dmatrix_cls(self.dask_client, X, label=y, qid=qid)
    176     watchlist.append((d_eval, name))
--> 178 train_res = xgb.dask.train(
    179     self.dask_client, self.params, dtrain, evals=watchlist, **train_kwargs
    180 )
    181 self.booster: xgb.Booster = train_res["booster"]
    182 self.evals_result = train_res["history"]

File /usr/local/lib/python3.8/dist-packages/xgboost/core.py:575, in _deprecate_positional_args.<locals>.inner_f(*args, **kwargs)
    573 for k, arg in zip(sig.parameters, args):
    574     kwargs[k] = arg
--> 575 return f(**kwargs)

File /usr/local/lib/python3.8/dist-packages/xgboost/dask.py:1072, in train(client, params, dtrain, num_boost_round, evals, obj, feval, early_stopping_rounds, xgb_model, verbose_eval, callbacks, custom_metric)
   1070 client = _xgb_get_client(client)
   1071 args = locals()
-> 1072 return client.sync(
   1073     _train_async,
   1074     global_config=config.get_config(),
   1075     dconfig=_get_dask_config(),
   1076     **args,
   1077 )

File /usr/local/lib/python3.8/dist-packages/distributed/utils.py:338, in SyncMethodMixin.sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    336     return future
    337 else:
--> 338     return sync(
    339         self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    340     )

File /usr/local/lib/python3.8/dist-packages/distributed/utils.py:405, in sync(loop, func, callback_timeout, *args, **kwargs)
    403 if error:
    404     typ, exc, tb = error
--> 405     raise exc.with_traceback(tb)
    406 else:
    407     return result

File /usr/local/lib/python3.8/dist-packages/distributed/utils.py:378, in sync.<locals>.f()
    376         future = asyncio.wait_for(future, callback_timeout)
    377     future = asyncio.ensure_future(future)
--> 378     result = yield future
    379 except Exception:
    380     error = sys.exc_info()

File /usr/local/lib/python3.8/dist-packages/tornado/gen.py:762, in Runner.run(self)
    759 exc_info = None
    761 try:
--> 762     value = future.result()
    763 except Exception:
    764     exc_info = sys.exc_info()

File /usr/local/lib/python3.8/dist-packages/xgboost/dask.py:1010, in _train_async(client, global_config, dconfig, params, dtrain, num_boost_round, evals, obj, feval, early_stopping_rounds, verbose_eval, xgb_model, callbacks, custom_metric)
   1007     evals_name = []
   1008     evals_id = []
-> 1010 results = await map_worker_partitions(
   1011     client,
   1012     dispatched_train,
   1013     params,
   1014     _rabit_args,
   1015     id(dtrain),
   1016     evals_name,
   1017     evals_id,
   1018     *([dtrain] + evals_data),
   1019     workers=workers,
   1020 )
   1021 return list(filter(lambda ret: ret is not None, results))[0]

File /usr/local/lib/python3.8/dist-packages/xgboost/dask.py:570, in map_worker_partitions(client, func, workers, *refs)
    566     fut = client.submit(
    567         func, *args, pure=False, workers=[addr], allow_other_workers=False
    568     )
    569     futures.append(fut)
--> 570 results = await client.gather(futures)
    571 return results

File /usr/local/lib/python3.8/dist-packages/distributed/client.py:2038, in Client._gather(self, futures, errors, direct, local_worker)
   2036         exc = CancelledError(key)
   2037     else:
-> 2038         raise exception.with_traceback(traceback)
   2039     raise exc
   2040 if errors == "skip":

File /usr/local/lib/python3.8/dist-packages/xgboost/dask.py:974, in dispatched_train()
    972         evals.append((Xy, evals_name[i]))
    973         continue
--> 974     eval_Xy = _dmatrix_from_list_of_parts(**ref, nthread=n_threads)
    975     evals.append((eval_Xy, evals_name[i]))
    977 booster = worker_train(
    978     params=local_param,
    979     dtrain=Xy,
   (...)
    989     callbacks=callbacks,
    990 )

File /usr/local/lib/python3.8/dist-packages/xgboost/dask.py:845, in _dmatrix_from_list_of_parts()
    843 if is_quantile:
    844     return _create_device_quantile_dmatrix(**kwargs)
--> 845 return _create_dmatrix(**kwargs)

File /usr/local/lib/python3.8/dist-packages/xgboost/dask.py:828, in _create_dmatrix()
    825     v = concat_or_none(value)
    826     concated_dict[key] = v
--> 828 dmatrix = DMatrix(
    829     **concated_dict,
    830     missing=missing,
    831     feature_names=feature_names,
    832     feature_types=feature_types,
    833     nthread=nthread,
    834     enable_categorical=enable_categorical,
    835     feature_weights=feature_weights,
    836 )
    837 return dmatrix

File /usr/local/lib/python3.8/dist-packages/xgboost/core.py:575, in inner_f()
    573 for k, arg in zip(sig.parameters, args):
    574     kwargs[k] = arg
--> 575 return f(**kwargs)

File /usr/local/lib/python3.8/dist-packages/xgboost/core.py:686, in __init__()
    683     assert self.handle is not None
    684     return
--> 686 handle, feature_names, feature_types = dispatch_data_backend(
    687     data,
    688     missing=self.missing,
    689     threads=self.nthread,
    690     feature_names=feature_names,
    691     feature_types=feature_types,
    692     enable_categorical=enable_categorical,
    693 )
    694 assert handle is not None
    695 self.handle = handle

File /usr/local/lib/python3.8/dist-packages/xgboost/data.py:896, in dispatch_data_backend()
    892     return _from_pandas_series(
    893         data, missing, threads, enable_categorical, feature_names, feature_types
    894     )
    895 if _is_cudf_df(data) or _is_cudf_ser(data):
--> 896     return _from_cudf_df(
    897         data, missing, threads, feature_names, feature_types, enable_categorical
    898     )
    899 if _is_cupy_array(data):
    900     return _from_cupy_array(data, missing, threads, feature_names,
    901                             feature_types)

File /usr/local/lib/python3.8/dist-packages/xgboost/data.py:691, in _from_cudf_df()
    689 handle = ctypes.c_void_p()
    690 config = bytes(json.dumps({"missing": missing, "nthread": nthread}), "utf-8")
--> 691 _check_call(
    692     _LIB.XGDMatrixCreateFromCudaColumnar(
    693         interfaces_str,
    694         config,
    695         ctypes.byref(handle),
    696     )
    697 )
    698 return handle, feature_names, feature_types

File /usr/local/lib/python3.8/dist-packages/xgboost/core.py:246, in _check_call()
    235 """Check the return value of C API call
    236 
    237 This function will raise exception when error occurs.
   (...)
    243     return value from API calls
    244 """
    245 if ret != 0:
--> 246     raise XGBoostError(py_str(_LIB.XGBGetLastError()))

XGBoostError: [09:42:09] ../src/c_api/../data/../common/device_helpers.cuh:428: Memory allocation error on worker 0: std::bad_alloc: cudaErrorMemoryAllocation: out of memory
- Free memory: 1238040576
- Requested memory: 4589462176

Stack trace:
  [bt] (0) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x3a5799) [0x7f2b94c4f799]
  [bt] (1) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x3a9bab) [0x7f2b94c53bab]
  [bt] (2) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x8dc49) [0x7f2b94937c49]
  [bt] (3) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x3edbeb) [0x7f2b94c97beb]
  [bt] (4) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x3ee759) [0x7f2b94c98759]
  [bt] (5) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x3ee993) [0x7f2b94c98993]
  [bt] (6) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x45b7f9) [0x7f2b94d057f9]
  [bt] (7) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x45bec0) [0x7f2b94d05ec0]
  [bt] (8) /usr/local/lib/python3.8/dist-packages/xgboost/lib/libxgboost.so(+0x421b40) [0x7f2b94ccbb40]

2023-01-20 09:42:45,541 - distributed.client - ERROR - Failed to reconnect to scheduler after 30.00 seconds, closing client
bilzard commented 1 year ago

Here is visualization of nvitop. In the beginning, the GPU memory consumption is quite healty(~65%), however, suddenly the process required GPU memory close to 100% and died.

Screen Shot 2023-01-20 at 18 42 21

bilzard commented 1 year ago

I believe this issue should be issued to this repo. So I closing this issue. https://github.com/NVIDIA-Merlin/models