About gpu memory - Githubissues

Nota-NetsPresso / BK-SDM

A Compressed Stable Diffusion for Efficient Text-to-Image Generation [ECCV'24]

Other

238 stars 16 forks source link

About gpu memory #46

Closed monkeyCv closed 9 months ago

monkeyCv commented 10 months ago

Thanks for your great work. May I ask a question about the GPU mermory? You write

A toy script can be used to verify the code executability and find the batch size that matches your GPU. With a batch size of 8 (=4×2), training BK-SDM-Base for 20 iterations takes about 5 minutes and 22GB GPU memory.

With a batch size of 256 (=4×64), training BK-SDM-Base for 50K iterations takes about 300 hours and 53GB GPU memory. With a batch size of 64 (=4×16), it takes 60 hours and 28GB GPU memory.

That is about batch size increase about 32x (from 2 to 64), but gpu memory only inscrease less than 3x (from 22G to 53G). Why the gpu memory is so saving? Does the diffusers more gpu efficient than pytorch-lightning (sd v1.5 used)? Thanks very much

bokyeong1015 commented 9 months ago

Hi, If interested, check out the excellent post in docs/transformers—The components included in GPU memory are as follows: 1. Model weights, 2. Optimizer states, 3. Gradients, 4. Forward activations saved for gradient computation, 5. Temporary buffers, 6. Functionality-specific memory.

In our experiments, torch version and gradient checkpointing matter for saved GPU memory when using large batch sizes.

GPU memory for training BK-SDM-Base on a single A100

torch version	gradient checkpointing	batch 8 (4x2)	batch 64 (4x16)	batch 256 (4x64)
torch 2.0.1	O	20515MiB	27315MiB	53477MiB
torch 2.0.1	X	20961MiB	37975MiB	OOM
torch 1.13.1	O	21917MiB	64117MiB	OOM
torch 1.13.1	X	26531MiB	OOM	OOM

Using our script with a quick test setting (max_train_steps=10; checkpointing_steps 4; valid_steps 1)

Does the diffusers more gpu efficient than pytorch-lightning (sd v1.5 used)?

It's a bit difficult for us to provide opinions; we hope you understand.

monkeyCv commented 9 months ago

Hi, If interested, check out the excellent post in docs/transformers—The components included in GPU memory are as follows: 1. Model weights, 2. Optimizer states, 3. Gradients, 4. Forward activations saved for gradient computation, 5. Temporary buffers, 6. Functionality-specific memory.

In our experiments, torch version and gradient checkpointing matter for saved GPU memory when using large batch sizes.

GPU memory for training BK-SDM-Base on a single A100

torch version gradientcheckpointing batch 8(4x2) batch 64(4x16) batch 256(4x64)

torch 2.0.1 O 20515MiB 27315MiB 53477MiB

torch 2.0.1 X 20961MiB 37975MiB OOM

torch 1.13.1 O 21917MiB 64117MiB OOM

torch 1.13.1 X 26531MiB OOM OOM

Using our script with a quick test setting (max_train_steps=10; checkpointing_steps 4; valid_steps 1)

Does the diffusers more gpu efficient than pytorch-lightning (sd v1.5 used)?

It's a bit difficult for us to provide opinions; we hope you understand.

Thank you very much.