NVIDIA-Merlin / NVTabular

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
Apache License 2.0
1.04k stars 143 forks source link

[BUG] NVTabular runs into OOM or dies when scaling to large dataset #1683

Open bschifferer opened 2 years ago

bschifferer commented 2 years ago

Describe the bug I tried multiple workflows and run into different issues when I run on multi-GPU setup running NVTabular workflows on large datasets.

Error 1: Workers just die one after one Characteristic:

Error 2: Run into OOM Workflow:

features1 = (
    [['col1', 'col2']] >> 
    nvt.ops.Categorify()
)

features2 = (
    ['col3'] >>
    nvt.ops.Categorify(
        num_buckets=10_000_000
    )
)

targets = ['target1', 'target2']
features = features1+features2+targets

Characteristics:

MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /usr/include/rmm/mr/device/cuda_memory_resource.hpp:70: cudaErrorMemoryAllocation out of memory
viswa-nvidia commented 1 year ago

@benfred , please check with @bschifferer on this

EvenOldridge commented 1 year ago

@rjzamora Any idea what could be happening here? I know you've been putting in some work on Categorify. I think this is happening during the compute of all uniques, which we may want to allow as an input into the op since it's a relatively straightforward piece of information to pull from a data lake.

rjzamora commented 1 year ago

Any idea what could be happening here?

I suppose there are many possibilities, depending on if the failure happens in the fit or the transform. For example, https://github.com/NVIDIA-Merlin/NVTabular/pull/1692 explains two reasons why the fit could be a problem with the current implementation (lack of a "proper" tree reduction, and the requirement to write all uniques for a given column to disk at once).

rjzamora commented 1 year ago

@bschifferer - I'd like to explore if https://github.com/NVIDIA-Merlin/NVTabular/pull/1692 (or some variation of it) can help with this. Can you share details about the system you are running on and a representative/toy dataset where you are seeing issues? (feel free to contact me offline about the dataset)

viswa-nvidia commented 1 year ago

@bschifferer , please update the status of this ticket. Are we workign on this data set now ?