[BUG] NVTabular runs into OOM or dies when scaling to large dataset

bschifferer commented 2 years ago

Describe the bug I tried multiple workflows and run into different issues when I run on multi-GPU setup running NVTabular workflows on large datasets.

Error 1: Workers just die one after one Characteristic:

Dataset size: ~200 million rows
~200 columns

some categorify ops, some minmaxnormalization, some lambda ops

2022-09-15 12:59:09,526 - tornado.application - ERROR - Exception in callback <function Worker.__init__.<locals>.<lambda> at 0x7fc9dd565e50>
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py", line 921, in _run
val = self.callback()
File "/usr/local/lib/python3.8/dist-packages/distributed/worker.py", line 773, in <lambda>
lambda: self.batched_stream.send({"op": "keep-alive"}), 60000
File "/usr/local/lib/python3.8/dist-packages/distributed/batched.py", line 137, in send
raise CommClosedError(f"Comm {self.comm!r} already closed.")
distributed.comm.core.CommClosedError: Comm <TCP (closed) Worker->Scheduler local=tcp://127.0.0.1:48834 remote=tcp://127.0.0.1:36323> already closed.
2022-09-15 12:59:09,533 - tornado.application - ERROR - Exception in callback <function Worker.__init__.<locals>.<lambda> at 0x7f6c60fe9940>
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py", line 921, in _run
val = self.callback()
File "/usr/local/lib/python3.8/dist-packages/distributed/worker.py", line 773, in <lambda>
lambda: self.batched_stream.send({"op": "keep-alive"}), 60000
File "/usr/local/lib/python3.8/dist-packages/distributed/batched.py", line 137, in send
raise CommClosedError(f"Comm {self.comm!r} already closed.")
distributed.comm.core.CommClosedError: Comm <TCP (closed) Worker->Scheduler local=tcp://127.0.0.1:48832 remote=tcp://127.0.0.1:36323> already closed.

Error 2: Run into OOM Workflow:

features1 = (
    [['col1', 'col2']] >> 
    nvt.ops.Categorify()
)

features2 = (
    ['col3'] >>
    nvt.ops.Categorify(
        num_buckets=10_000_000
    )
)

targets = ['target1', 'target2']
features = features1+features2+targets

Characteristics:

~35000 files, ~350GB parquet, 15 billion rows

MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /usr/include/rmm/mr/device/cuda_memory_resource.hpp:70: cudaErrorMemoryAllocation out of memory

viswa-nvidia commented 1 year ago

@benfred , please check with @bschifferer on this

EvenOldridge commented 1 year ago

@rjzamora Any idea what could be happening here? I know you've been putting in some work on Categorify. I think this is happening during the compute of all uniques, which we may want to allow as an input into the op since it's a relatively straightforward piece of information to pull from a data lake.

rjzamora commented 1 year ago

Any idea what could be happening here?

I suppose there are many possibilities, depending on if the failure happens in the fit or the transform. For example, https://github.com/NVIDIA-Merlin/NVTabular/pull/1692 explains two reasons why the fit could be a problem with the current implementation (lack of a "proper" tree reduction, and the requirement to write all uniques for a given column to disk at once).

rjzamora commented 1 year ago

@bschifferer - I'd like to explore if https://github.com/NVIDIA-Merlin/NVTabular/pull/1692 (or some variation of it) can help with this. Can you share details about the system you are running on and a representative/toy dataset where you are seeing issues? (feel free to contact me offline about the dataset)

viswa-nvidia commented 1 year ago

@bschifferer , please update the status of this ticket. Are we workign on this data set now ?

NVIDIA-Merlin / NVTabular

[BUG] NVTabular runs into OOM or dies when scaling to large dataset #1683