NVIDIA-Merlin / NVTabular

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
Apache License 2.0
1.05k stars 143 forks source link

[QST] Additional GPU mem reservation when creating a `Dataset` causes OOM when allocating all GPU mem to the LocalCUDACluster #1863

Open piojanu opened 1 year ago

piojanu commented 1 year ago

Hi!

I've run into such a problem:

from dask_cuda import LocalCUDACluster
from dask.distributed import Client

cluster = LocalCUDACluster(
    n_workers=1,                 # Number of GPU workers
    device_memory_limit="12GB",  # GPU->CPU spill threshold (~75% of GPU memory)
    rmm_pool_size="16GB",        # Memory pool size on each worker
)
client = Client(cluster)

# NOTE: Importing Merlin before cluster creation will ALSO create this additional reservation on GPU
from merlin.core.utils import set_dask_client

set_dask_client(client)

import numpy as np
import pandas as pd

df = pd.DataFrame({
    "a": np.random.randint(0, 100, 1000),
    "b": np.random.randint(0, 100, 1000),
    "c": np.random.randint(0, 100, 1000),
})

import nvtabular as nvt

# No matter if I specify the `client` or not, there is an additional reservation on GPU created
# that causes "cudaErrorMemoryAllocation out of memory" here.
ds = nvt.Dataset(df, client=client)

I run this code in JupyterLab on the GCP VM with NVIDIA V100 16GB GPU. I've also tried nvtabular.utils.set_dask_client and it didn't solve the problem.

Questions:

rnyak commented 1 year ago

@piojanu what helps with the OOM issues with NVT is the part_size and the row group memory size of your parquet file(s). you can also repartition your dataset and save back to disk, and that might help with OOM. for LocalCUDACluster args you can read here: https://docs.rapids.ai/api/dask-cuda/nightly/api/

if you have a single GPU you can try to set the row group size of your files and that would help without LocalCudaCluster.

There is a LocalCudaCluster example here: https://github.com/NVIDIA-Merlin/Merlin/blob/main/examples/quick_start/scripts/preproc/preprocessing.py

piojanu commented 1 year ago

Hi!

I have follow up questions:

Thanks for help :)

piojanu commented 1 year ago

By accident, I've found out that merlin.io.dataset.Dataset.shuffle_by_keys is the root of this OOM.