Potential bug on z-loss calculation

leloykun commented 3 months ago

https://github.com/Alpha-VLLM/Lumina-mGPT/blob/c8e180aa20f0a5977bf168424f30aa2be58fad94/lumina_mgpt/model/modeling_xllmx_chameleon.py#L50

The mask should be calculated using the shifted labels (labels shifted 1 token to the left) as in ChameleonModelForConditionalGeneration.forward

ChrisLiu6 commented 3 months ago

Cool, you are right, we'll fix it

leloykun commented 3 months ago

Thanks! <3

I'm also currently working on adding native support on Transformers for image generation (image-only & interleaved image-text) with Chameleon & Anole here: https://github.com/huggingface/transformers/pull/32013 . I'll add native support for this project too when I'm done with these two ^^.

I'm curious: how much of an effect did adding the CFG & the z-loss have?

ChrisLiu6 commented 3 months ago

Thanks! <3

I'm also currently working on adding native support on Transformers for image generation (image-only & interleaved image-text) with Chameleon & Anole here: huggingface/transformers#32013 . I'll add native support for this project too when I'm done with these two ^^.

I'm curious: how much of an effect did adding the CFG & the z-loss have?

Thank you so much for your support! Please feel free to reach out to us if you ever need any assistance.

Regarding CFG and z-loss:

The z-loss is extremely important for full-finetuning. In our experience with the 7B model, without z-loss, the training process collapses EVERY TIME just after a few hundred iterations. Note that we've tried printing of the z-loss value without involving its gradient for training, and we find that when z-loss is included in the training, the z-loss value typically stabilizes between 100 and 200. However, when it’s not included, the value quickly escalates into the thousands, indicating a fundamental surge in the norm of logits.

As for CFG, though it is not that indispensable, its impact is still significant. While Lumina-mGPT can still produce high-quality images without CFG, using CFG significantly increases the probability of generating good examples. Additionally, it helps achieve a better balance between content richness and structural coherence.

leloykun commented 3 months ago

without z-loss, the training process collapses EVERY TIME just after a few hundred iterations

Interesting...

In my finetuning runs, I've observed massive loss and gradient spikes at the beginning but they eventually stabiliized after just a few iterations

But that might just be because I'm starting from a really small learning rate (1e-5) & using a large batch size (32 across 8 GPUs)? How about yours?

I'll run a couple of finetuning runs w/ my setup + the z-loss and report back 🫡

ChrisLiu6 commented 3 months ago

This is an experiment with lr=2e-5, batch size 512 across 16 GPUs (with FSDP and checkpointing), and one epoch takes 3692 iterations. We can see that the loss first drops and then rises. In some of the other experiments, the loss would reach inf. We also find that the stability seems to be related to the data distribution and task difficulty. For example, when we finetune chameleon on fixed 512x512 images (which means the resolution is consistent with chameleon pretrianing), the procedure tends to be more stable than training with variable-aspect-ratio images.

leloykun commented 3 months ago

Oh wow, that looks cursed

As for my case, I used deepspeed to do DDP & finetuned 64x64 images (constant size) from an OOD dataset (with a lot of white background). I tried finetuning it on the 512x512 version of the dataset, but my models ended up mode collapsing 😅

Perhaps the z-loss is indeed the missing piece

Alpha-VLLM / Lumina-mGPT

Potential bug on z-loss calculation #13