BAAI-DCAI / Bunny

A family of lightweight multimodal models.
Apache License 2.0
866 stars 65 forks source link

Question about uneven distribution of GPU memory #33

Closed Xiaolong-RRL closed 4 months ago

Xiaolong-RRL commented 5 months ago

Dear author:

Thanks for your interesting work.

During the full or lora finetuning, the memory usage of different GPUs is uneven: image

I wander if it's correct? And how to handle it?

Thanks!!

Jancsi9981 commented 4 months ago

I also encountered the same problem, do you know how to solve it now?

RussRobin commented 4 months ago

Hi @Xiaolong-RRL & @Jancsi9981 ,

Thank you for your interest in our work.

Yes, uneven distribution of GPU memory is expected. Don’t worry about it. The reason behind is quite straight forward: Say, per device batch size is 8 and each image is encoded to 729 tokens. However, the text length for each sample is different, and will be encoded to different numbers of tokens, thus memory consumption on each GPU varies.

A side verification: when I train Bunny on one dataset, where word number in each QA is almost identical, memory consumption on each GPU is almost the same.

You may wonder why doesn’t GPU consumption remains almost the same (and imbalanced) during the whole training process. This is because of the memory reservation policy of torch. When memory is reserved, it won’t be freed even if it’s not used in the future, until the subprocess is killed. Let’s say in most cases 40G memory is needed. But some samples on GPU:0 in epoch 1/1000 are super long, so torch reserves 70G on GPU:0, and will not free unused 30G in the following steps. So nvidia-smi always show you unbalanced GPU memory consumption.

Please feel free to further comment on this issue if you still get confused.

Regards Russell BAAI

GewelsJI commented 4 months ago

Hi, @RussRobin Thanks for your detailed illustration here.

Actually, you've set the hyper-para model_max_length=2048, so that the whole bunch of input tokens would be the same, right?

Another question is: does the length of supervised sentence affect the GPU memory allocation?

Best, Daniel

RussRobin commented 4 months ago

TL;DR: No. Yes.

Max token = 2048 sets the upper bound of token number, and the upper bound of GPU memory.

Here is how the tokenizer works for phi-2 and siglip with our default parameters: an image is encoded into 729 tokens, and all texts in QA pairs are encoded ( in most cases a word means 1-2 tokens). So the token length for a sample is related to the length of QA. For most samples in Bunny_695k, token length is no more than 2048.

GPU allocation is related to the actual token length, and should be no more than the max token length.

You can try: in Bunny_695k, set max token length to 4096, and the GPU memory won’t change too much as compared to 2048.

Feel free to reach out if you have further question!

Regards Russell BAAI

Isaachhh commented 4 months ago

The distribution of GPU memory if training Bunny on a dataset of all same samples: Screenshot 2024-04-17 at 10 31 27

GewelsJI commented 4 months ago

The distribution of GPU memory if training Bunny on a dataset of all same samples: Screenshot 2024-04-17 at 10 31 27

How much batch size did you set in this test?

GewelsJI commented 4 months ago

TL;DR: No. Yes.

Max token = 2048 sets the upper bound of token number, and the upper bound of GPU memory.

Here is how the tokenizer works for phi-2 and siglip with our default parameters: an image is encoded into 729 tokens, and all texts in QA pairs are encoded ( in most cases a word means 1-2 tokens). So the token length for a sample is related to the length of QA. For most samples in Bunny_695k, token length is no more than 2048.

GPU allocation is related to the actual token length, and should be no more than the max token length.

You can try: in Bunny_695k, set max token length to 4096, and the GPU memory won’t change too much as compared to 2048.

Feel free to reach out if you have further question!

Regards Russell BAAI

Appreciated to your detailed response here. Thanks.

Isaachhh commented 4 months ago

How much batch size did you set in this test?

8 in each GPU, as default in the training script

GewelsJI commented 4 months ago

Awesome, thank you guys.

Cheers, Daniel.

Xiaolong-RRL commented 4 months ago

Hi @Xiaolong-RRL & @Jancsi9981 ,

Thank you for your interest in our work.

Yes, uneven distribution of GPU memory is expected. Don’t worry about it. The reason behind is quite straight forward: Say, per device batch size is 8 and each image is encoded to 729 tokens. However, the text length for each sample is different, and will be encoded to different numbers of tokens, thus memory consumption on each GPU varies.

A side verification: when I train Bunny on one dataset, where word number in each QA is almost identical, memory consumption on each GPU is almost the same.

You may wonder why doesn’t GPU consumption remains almost the same (and imbalanced) during the whole training process. This is because of the memory reservation policy of torch. When memory is reserved, it won’t be freed even if it’s not used in the future, until the subprocess is killed. Let’s say in most cases 40G memory is needed. But some samples on GPU:0 in epoch 1/1000 are super long, so torch reserves 70G on GPU:0, and will not free unused 30G in the following steps. So nvidia-smi always show you unbalanced GPU memory consumption.

Please feel free to further comment on this issue if you still get confused.

Regards Russell BAAI

Thanks for your detailed and kindly reply!

RussRobin commented 4 months ago

I'll close this issue since it seems that we have reached a consensus on it. Thank you again for your interest in our work.