[QST] Additional GPU mem reservation when creating a `Dataset` causes OOM when allocating all GPU mem to the LocalCUDACluster

piojanu commented 1 year ago

Hi!

I've run into such a problem:

from dask_cuda import LocalCUDACluster
from dask.distributed import Client

cluster = LocalCUDACluster(
    n_workers=1,                 # Number of GPU workers
    device_memory_limit="12GB",  # GPU->CPU spill threshold (~75% of GPU memory)
    rmm_pool_size="16GB",        # Memory pool size on each worker
)
client = Client(cluster)

# NOTE: Importing Merlin before cluster creation will ALSO create this additional reservation on GPU
from merlin.core.utils import set_dask_client

set_dask_client(client)

import numpy as np
import pandas as pd

df = pd.DataFrame({
    "a": np.random.randint(0, 100, 1000),
    "b": np.random.randint(0, 100, 1000),
    "c": np.random.randint(0, 100, 1000),
})

import nvtabular as nvt

# No matter if I specify the `client` or not, there is an additional reservation on GPU created
# that causes "cudaErrorMemoryAllocation out of memory" here.
ds = nvt.Dataset(df, client=client)

I run this code in JupyterLab on the GCP VM with NVIDIA V100 16GB GPU. I've also tried nvtabular.utils.set_dask_client and it didn't solve the problem.

Questions:

Is it expected behavior and I don't understand something?
Can't I simply allocate all memory for the cluster and make NVTabular use it?
How shall the LocalCUDACluster be configured then?

rnyak commented 1 year ago

@piojanu what helps with the OOM issues with NVT is the part_size and the row group memory size of your parquet file(s). you can also repartition your dataset and save back to disk, and that might help with OOM. for LocalCUDACluster args you can read here: https://docs.rapids.ai/api/dask-cuda/nightly/api/

if you have a single GPU you can try to set the row group size of your files and that would help without LocalCudaCluster.

There is a LocalCudaCluster example here: https://github.com/NVIDIA-Merlin/Merlin/blob/main/examples/quick_start/scripts/preproc/preprocessing.py

piojanu commented 1 year ago

Hi!

I have follow up questions:

What is the rule of thumb to set “part_size and the row group memory size”? Make them smaller or bigger? How one influences the other?
What do you mean by “you can also repartition your dataset and save back to disk”? Can you show me some code snippet?

Thanks for help :)

piojanu commented 1 year ago

By accident, I've found out that merlin.io.dataset.Dataset.shuffle_by_keys is the root of this OOM.

It makes ops.Categorify OOM even when following the troubleshooting guide:
- Setting up the LocalCUDACluster doesn't help.
- Saving with consistent row group size (maximum, row group can happen to be smaller because, in a dask partition which saves one file, there is usually a reminder that doesn't fit a full row group) doesn't help.
I've also tested it and there is no need to shuffle when doing GroupBy on the Dask DataFrame when loading data that was sorted by session_id in BigQuery.
- No matter if you set the arguments index and calculate_divisions to the dd.read_parquet.
- However, nvt.ops.GroupBy on the same data doesn't return the expected number of sessions.

NVIDIA-Merlin / NVTabular

[QST] Additional GPU mem reservation when creating a `Dataset` causes OOM when allocating all GPU mem to the LocalCUDACluster #1863