Open JacobYuan7 opened 2 months ago
I run into the same problem as well. It seems that 48G RAM is not enough for the default full training. Do you have a solution for this?
Hi, thank you for your interest in our work! Could you tell me the type and number of your GPUs? Since we use FSDP during training, more GPUs will still lower the GPU memory requirement even when batch-size is set to 1.
I run into the same problem as well. It seems that 48G RAM is not enough for the default full training. Do you have a solution for this?
Just run on more graphic cards.
@ChrisLiu6 Many thanks for your feedback. I can run it by simply using more GPUs.
I have a follow-up question on the CFG technique. In the paper, you mentioned that "To make CFG work, during training, the context before
@ChrisLiu6 Many thanks for your feedback. I can run it by simply using more GPUs.
I have a follow-up question on the CFG technique. In the paper, you mentioned that "To make CFG work, during training, the context before is randomly dropped by a probability of 10%.". May I know that where do you implement this in the codebase? Many thanks!
The implementation for context drop is absent in the released codes because our implementation is highly data-format-dependent and needs modification when the data format is different.
As a reference implementation, you may add the following codes after https://github.com/Alpha-VLLM/Lumina-mGPT/blob/104abe453ec1acca5863698629c4db2111b0b3fc/lumina_mgpt/finetune_solver.py#L24
if tokens[-2] == labels[-2] == 8196 and tokens.count(8196)==1: # image generation data
if random.random() < 0.1:
tokens = labels = [_ for _ in labels[:-1] if _ != -100]
@ChrisLiu6 Many thanks for your feedback. I can run it by simply using more GPUs. I have a follow-up question on the CFG technique. In the paper, you mentioned that "To make CFG work, during training, the context before is randomly dropped by a probability of 10%.". May I know that where do you implement this in the codebase? Many thanks!
The implementation for context drop is absent in the released codes because our implementation is highly data-format-dependent and needs modification when the data format is different.
As a reference implementation, you may add the following codes after
if tokens[-2] == labels[-2] == 8196 and tokens.count(8196)==1: # image generation data if random.random() < 0.1: tokens = labels = [_ for _ in labels[:-1] if _ != -100]
@ChrisLiu6
Many thanks for your prompt feedback.
As I understand it, for image generation, this is quite important for the CFG to work. For Omnipotent SFT, we should not do this random drop. Is my understanding of "our implementation is highly data-format-dependent" correct?
Btw, labels[:-1] drops the last token. Is this intentional or a mistake?
Hi, many thanks for your great work.
I am trying to use the default script for training. I find that even if I use batch_size=1, training runs out of memory. I am wondering what might cause the problem. I'd appreciate any suggestions.