shoyasaxa commented 1 year ago

❓ Questions & Help

Hello, I was just playing around with using NVT dataloaders with TorchRec, and it was working fine for the most part. However, when it came to trying out batch inference on a large dataset, I ran into a peculiar bug where the script would run perfectly fine for about an hour with stable GPU memory usage (at around 94% for the first GPU), then suddenly at random point the GPU memory (for the first GPU out of four V100's I used) would start to creep up towards 100% and quickly OOM. Weirdly, I am no longer able to reproduce this issue, but nevertheless I was wondering if anyone had any ideas on why that could be the case.

One possible idea that @rnyak suggested was that perhaps the data partitions are not evenly split, and one of the files happen to have bigger partitions than other files. So when it comes to loading that one file, the GPU memory usage shoots up.

Also I am using NVTabular to preprocess the data. One feature request I have is for NVTabular to spit out the most optimal number of files when preprocessing (currently if I use 4 GPUs to preprocess a humongous dataset without setting a out_files_per_proc parameter, it spits out 4 humongous files).

rnyak commented 1 year ago

@shoyasaxa thanks for creating the ticket.

just to clarify: I thought you are doing batch inference on multiple GPUs? not on single GPU? can you please confirm/clarify that?

My suggestions was particularly for multi-gpu training case.. meaning for example if you train your model with multiple-gpu we expect the number of partitions per parquet file is divisible by number of GPUs. That means, if you are using 4 GPUs at the same time for model training (or inference) via torch.nn.parallel(), or torch.distributed, your parquet files should have 4, or 8, or 12, or 16, .. partitions that can be evenly distributed over GPUs.

shoyasaxa commented 1 year ago

Yes - this is doing batch inference on multiple GPUs (one instance with 4 V100 GPUs).

And yes - I also do the preprocessing using 4 GPUs, so the number of files outputted is a multiple of 4 as well.

rnyak commented 1 year ago

(currently if I use 4 GPUs to preprocess a humongous dataset without setting a out_files_per_proc parameter, it spits out 4 humongous files).

we have a WIP PR hopefully will answer your request..

NVIDIA-Merlin / Merlin

[QST] An odd, sudden OOM using NVT Dataloaders with TorchRec #842

❓ Questions & Help