How can we optimize GPU usage or transition from data parallel to model parallel?

fatbrowncown commented 5 months ago

Hi there,

I just wanted to express my appreciation for your incredible work. As I dive deeper into this field, I've encountered some limitations with my hardware resources - 2 GPUs T4, each with 16GB of VRAM. Currently, I'm trying to train a demo interpolation model using the first 100 videos of the WebVid-10M-motion dataset. To optimize my training, I made several adjustments:

Set accumulate_grad_batches to 8
Configured validation to occur every 10 epochs (check_val_every_n_epoch)
Implemented the DeepSpeed Stage 2 strategy
Change optimizer Adam, AdamW to Adam8bit and AdamW8bit

However, these adjustments don't seem to be sufficient, as I continue to encounter OOM errors. I'm seeking advice on further reducing GPU usage. Specifically:

Are there any additional parameters or strategies I can tweak?
If not, how can I transition from data parallel strategy to model parallel to better manage memory consumption?

Thank you for your guidance. Have a nice day

Doubiiu commented 5 months ago

Hi. Thanks for your interest. The adjustments you've listed are all the ways I could think of... I am sorry that I cannot provide other useful suggestions. Hope to get good news from you.

fatbrowncown commented 5 months ago

Nah, never mind, just a pop up question ^^ I’ve upgraded my hardware to an A100 40GB, and it seems to be working fine with the GPU consuming around 39.5 GB of VRAM. Is this normal?

Doubiiu / DynamiCrafter

How can we optimize GPU usage or transition from data parallel to model parallel? #95