Open bschifferer opened 2 years ago
@benfred , please check with @bschifferer on this
@rjzamora Any idea what could be happening here? I know you've been putting in some work on Categorify. I think this is happening during the compute of all uniques, which we may want to allow as an input into the op since it's a relatively straightforward piece of information to pull from a data lake.
Any idea what could be happening here?
I suppose there are many possibilities, depending on if the failure happens in the fit or the transform. For example, https://github.com/NVIDIA-Merlin/NVTabular/pull/1692 explains two reasons why the fit
could be a problem with the current implementation (lack of a "proper" tree reduction, and the requirement to write all uniques for a given column to disk at once).
@bschifferer - I'd like to explore if https://github.com/NVIDIA-Merlin/NVTabular/pull/1692 (or some variation of it) can help with this. Can you share details about the system you are running on and a representative/toy dataset where you are seeing issues? (feel free to contact me offline about the dataset)
@bschifferer , please update the status of this ticket. Are we workign on this data set now ?
Describe the bug I tried multiple workflows and run into different issues when I run on multi-GPU setup running NVTabular workflows on large datasets.
Error 1: Workers just die one after one Characteristic:
Error 2: Run into OOM Workflow:
Characteristics: