Out of memory using the default training configuration

JacobYuan7 commented 2 months ago

Hi, many thanks for your great work.

I am trying to use the default script for training. I find that even if I use batch_size=1, training runs out of memory. I am wondering what might cause the problem. I'd appreciate any suggestions.


[rank0]: Traceback (most recent call last):
[rank0]:   File "/mnt/workspace/xxxx/Lumina-mGPT-main/lumina_mgpt/finetune_solver.py", line 114, in <module>
[rank0]:     solver.run()
[rank0]:   File "/mnt/workspace/xxxx/Lumina-mGPT-main/xllmx/solvers/finetune/finetune.py", line 518, in run
[rank0]:     train_stats = self.train_one_epoch(
[rank0]:   File "/mnt/workspace/xxxx/Lumina-mGPT-main/xllmx/solvers/finetune/finetune.py", line 620, in train_one_epoch
[rank0]:     self.optimizer.step()
[rank0]:   File "/mnt/workspace/xxxx/conda-envs/lumina-mgpt-5/lib/python3.10/site-packages/torch/optim/optimizer.py", line 391, in wrapper
[rank0]:     out = func(*args, **kwargs)
[rank0]:   File "/mnt/workspace/xxxx/conda-envs/lumina-mgpt-5/lib/python3.10/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
[rank0]:     ret = func(self, *args, **kwargs)
[rank0]:   File "/mnt/workspace/xxxx/conda-envs/lumina-mgpt-5/lib/python3.10/site-packages/torch/optim/adamw.py", line 177, in step
[rank0]:     has_complex = self._init_group(
[rank0]:   File "/mnt/workspace/xxxx/conda-envs/lumina-mgpt-5/lib/python3.10/site-packages/torch/optim/adamw.py", line 128, in _init_group
[rank0]:     state["exp_avg_sq"] = torch.zeros_like(
[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 
exp name: 7B-8

xiexing0916 commented 1 month ago

I run into the same problem as well. It seems that 48G RAM is not enough for the default full training. Do you have a solution for this?

ChrisLiu6 commented 1 month ago

Hi, thank you for your interest in our work! Could you tell me the type and number of your GPUs? Since we use FSDP during training, more GPUs will still lower the GPU memory requirement even when batch-size is set to 1.

JacobYuan7 commented 1 month ago

I run into the same problem as well. It seems that 48G RAM is not enough for the default full training. Do you have a solution for this?

Just run on more graphic cards.

JacobYuan7 commented 1 month ago

@ChrisLiu6 Many thanks for your feedback. I can run it by simply using more GPUs.

I have a follow-up question on the CFG technique. In the paper, you mentioned that "To make CFG work, during training, the context before is randomly dropped by a probability of 10%.". May I know that where do you implement this in the codebase? Many thanks!

ChrisLiu6 commented 1 month ago

@ChrisLiu6 Many thanks for your feedback. I can run it by simply using more GPUs.

I have a follow-up question on the CFG technique. In the paper, you mentioned that "To make CFG work, during training, the context before is randomly dropped by a probability of 10%.". May I know that where do you implement this in the codebase? Many thanks!

The implementation for context drop is absent in the released codes because our implementation is highly data-format-dependent and needs modification when the data format is different.

As a reference implementation, you may add the following codes after https://github.com/Alpha-VLLM/Lumina-mGPT/blob/104abe453ec1acca5863698629c4db2111b0b3fc/lumina_mgpt/finetune_solver.py#L24

        if tokens[-2] == labels[-2] == 8196 and tokens.count(8196)==1:  # image generation data
            if random.random() < 0.1:
                tokens = labels = [_ for _ in labels[:-1] if _ != -100]

JacobYuan7 commented 1 month ago

@ChrisLiu6 Many thanks for your feedback. I can run it by simply using more GPUs. I have a follow-up question on the CFG technique. In the paper, you mentioned that "To make CFG work, during training, the context before is randomly dropped by a probability of 10%.". May I know that where do you implement this in the codebase? Many thanks!

The implementation for context drop is absent in the released codes because our implementation is highly data-format-dependent and needs modification when the data format is different.

As a reference implementation, you may add the following codes after

https://github.com/Alpha-VLLM/Lumina-mGPT/blob/104abe453ec1acca5863698629c4db2111b0b3fc/lumina_mgpt/finetune_solver.py#L24
        if tokens[-2] == labels[-2] == 8196 and tokens.count(8196)==1:  # image generation data
            if random.random() < 0.1:
                tokens = labels = [_ for _ in labels[:-1] if _ != -100]

@ChrisLiu6
Many thanks for your prompt feedback.

As I understand it, for image generation, this is quite important for the CFG to work. For Omnipotent SFT, we should not do this random drop. Is my understanding of "our implementation is highly data-format-dependent" correct?

Btw, labels[:-1] drops the last token. Is this intentional or a mistake?

Alpha-VLLM / Lumina-mGPT

Out of memory using the default training configuration #28