vae bf16 training loss nan

PKU-YuanGroup / Open-Sora-Plan

This project aim to reproduce Sora (Open AI T2V model), we wish the open source community contribute to this project.

MIT License

10.97k stars 978 forks source link

vae bf16 training loss nan #265

Open lbwang2006 opened 2 months ago

lbwang2006 commented 2 months ago

vae bf16 training loss nan, pytorch_lighting, how to solve this

LinB203 commented 2 months ago

Do you enable the gan loss? We also meet it, it will happen after ~30-50k steps. But it does not matter, just resume it.

lbwang2006 commented 2 months ago

Do you enable the gan loss? We also meet it, it will happen after ~30-50k steps. But it does not matter, just resume it.

yes, I enable the gan loss, and the loss is nan, and does not get better. only restart training script with the model with the latest good checkpoint?

lbwang2006 commented 2 months ago

and is gan loss necessary if it is easy to lead nan loss？

qqingzheng commented 2 months ago

and is gan loss necessary if it is easy to lead nan loss？

The GAN loss plays a crucial role in preserving high-frequency information and should not be omitted.

LinB203 commented 2 months ago

In v1.0.0 we didn't use gan loss. In v1.1.0 vae's capabilities will be vastly improved.

lbwang2006 commented 2 months ago

In v1.0.0 we didn't use gan loss. In v1.1.0 vae's capabilities will be vastly improved.

I found the config in the current causalvae, loss type is opensora.models.ae.videobase.losses.LPIPSWithDiscriminator, I think gan loss has already been used?

qqingzheng commented 2 months ago

In v1.0.0 we didn't use gan loss. In v1.1.0 vae's capabilities will be vastly improved.

I found the config in the current causalvae, loss type is

opensora.models.ae.videobase.losses.LPIPSWithDiscriminator, I think gan loss has already been used?

Sorry for that. Due to a previous code refactoring, the config.json file was added after the training of the released causalvae. It is sure that the release model was trained without the use of a GAN.

antonioo-c commented 2 months ago

Thanks for the great project. I wonder when will you release the new version of training code?

LinB203 commented 2 months ago

This month.

Thanks for the great project. I wonder when will you release the new version of training code?

awei-6 commented 2 months ago

https://github.com/PKU-YuanGroup/Open-Sora-Plan/blob/bec0e8523840f34cd7e687cb6fe6fb92ba3f991c/opensora/models/ae/videobase/losses/perceptual_loss.py#L95C1-L95C88

The nll_grads is easy to exceed the maximum precision that bf16 can represent, it is recommended not to use amp training and use float32 training.

lbwang2006 commented 2 months ago

In v1.0.0 we didn't use gan loss. In v1.1.0 vae's capabilities will be vastly improved.

but I found loss.discrimator in the v1.1.0 vae weight....