How to skip the batch when OOM happens?

ashleve / lightning-hydra-template

PyTorch Lightning + Hydra. A very user-friendly template for ML experimentation. ⚡🔥⚡

4.09k stars 636 forks source link

How to skip the batch when OOM happens? #652

Open Xinheng-He opened 1 week ago

Xinheng-He commented 1 week ago

Hi developers:

Hydra-lightning is a really cool tool and I like it! However, my batch includes highly different size of graphs and sometimes it causes OOM issues. Previously I would manually skip this batch but in hydra-lightning it seems hard to do this. Is it possible to add it in a future version, or how can I skip batch when such batch OOM (out of memory in GPU)?

Xinheng

Xinheng-He commented 1 week ago

I made it by adding a module in trainind_step like this, however, when I run the code on multiple GPUs, it stops when return None (no matter clean or not), I think maybe such trick can only be played on single GPU training. Wish it helps for others.