PKU-YuanGroup / Open-Sora-Plan

This project aim to reproduce Sora (Open AI T2V model), we wish the open source community contribute to this project.
MIT License
11.54k stars 1.02k forks source link

After finetune the model, inference still get noise. #396

Open wtjiang98 opened 2 months ago

wtjiang98 commented 2 months ago

I fine tuned the 93x480p with my own collected video dataset and add the pose guidance for control sign. Here's a graph of the loss for my training, the training seems to be pretty good: image

However, when I inference with the saved checkpoint, it produce noise result:

https://github.com/user-attachments/assets/69ef8ac9-e5fb-4731-bc84-921159a9be55

Then I try to simply decode the latent that obtain from the predicted noise when training via the following code: output_latent= noisy_model_input - model_pred video = vae.decode(output_latent)

https://github.com/user-attachments/assets/cac7a519-0c85-448e-9873-9d296deefe75

The result seems normal. I am wondering where is the problem and how can I solve this. Thank you!

BTW: My issue is quite similar with #394. It seems to be a common problem.

spacegoing commented 2 months ago

@wtjiang98 genius

wtjiang98 commented 2 months ago

@wtjiang98 genius

Still don't know how to solve the problem. Should we modify the inference code?

spacegoing commented 2 months ago

@wtjiang98 genius

Still don't know how to solve the problem. Should we modify the inference code?

Ah I see. I thought u solved the issue LoL. May I ask where did u add the part?

output_latent= noisy_model_input - model_pred
video = vae.decode(output_latent)

Since u can decode without noise, why problem not solved?

spacegoing commented 2 months ago

@wtjiang98 genius

Still don't know how to solve the problem. Should we modify the inference code?

Ah I see. I thought u solved the issue LoL. May I ask where did u add the part?

output_latent= noisy_model_input - model_pred
video = vae.decode(output_latent)

Since u can decode without noise, why problem not solved?

It's here https://github.com/PKU-YuanGroup/Open-Sora-Plan/blob/842435c016d358b6e6aa1ec788cbb593c5cdec32/opensora/train/train_t2v_diffusers.py#L631

LinB203 commented 2 months ago

For fine-tuning, we offer the following suggestions:

  1. Reduce the learning rate. We recommend a learning rate of 1e-5 to 1e-6 for fine-tuning.
  2. If there are additional modules added, load the pre-trained weights and use zero init for inference. This will verify that the initialization or code is correct.
  3. Always keep an eye on the loss curve, if there is spike loss, then it is very likely that the model has collapsed, resume training from the nearest checkpoint. If the training process crashes frequently, then consider increasing the batch size or continue to reduce the learning rate.
spacegoing commented 2 months ago

For fine-tuning, we offer the following suggestions:

  1. Reduce the learning rate. We recommend a learning rate of 1e-5 to 1e-6 for fine-tuning.
  2. If there are additional modules added, load the pre-trained weights and use zero init for inference. This will verify that the initialization or code is correct.
  3. Always keep an eye on the loss curve, if there is spike loss, then it is very likely that the model has collapsed, resume training from the nearest checkpoint. If the training process crashes frequently, then consider increasing the batch size or continue to reduce the learning rate.

Hi Lin,

Many thanks for ur reply.

My loss curve is very similar to the posted one: Loss decrease without significant spike noticed at few thousands steps. And decoding straight from train script https://github.com/PKU-YuanGroup/Open-Sora-Plan/blob/842435c016d358b6e6aa1ec788cbb593c5cdec32/opensora/train/train_t2v_diffusers.py#L631 outputs clear video without noise.

Yet inference from the checkpoint will output noisy videos.

I noticed the model is trained with DDPM scheduler but inferenced with Euler scheduler, could this cause the unstable case? How do I further debug this?

My loss curve FYI, the lr is
43 --learning_rate=1e-4 \ :

image
wtjiang98 commented 2 months ago

For fine-tuning, we offer the following suggestions:

  1. Reduce the learning rate. We recommend a learning rate of 1e-5 to 1e-6 for fine-tuning.
  2. If there are additional modules added, load the pre-trained weights and use zero init for inference. This will verify that the initialization or code is correct.
  3. Always keep an eye on the loss curve, if there is spike loss, then it is very likely that the model has collapsed, resume training from the nearest checkpoint. If the training process crashes frequently, then consider increasing the batch size or continue to reduce the learning rate.

Hi Lin,

Many thanks for ur reply.

My loss curve is very similar to the posted one: Loss decrease without significant spike noticed. And decoding straight from train script

https://github.com/PKU-YuanGroup/Open-Sora-Plan/blob/842435c016d358b6e6aa1ec788cbb593c5cdec32/opensora/train/train_t2v_diffusers.py#L631

outputs clear video without noise. Yet inference from the checkpoint will output noisy videos.

I noticed the model is trained with DDPM scheduler but inferenced with Euler scheduler, could this cause the unstable case? How do I further debug this?

My loss curve FYI, the lr is 43 --learning_rate=1e-4 \ :

image

I also suspected that it is the scheduler problem, but I have changed the scheduler to DDPM when inference, still get noisy videos...

spacegoing commented 2 months ago

For fine-tuning, we offer the following suggestions:

  1. Reduce the learning rate. We recommend a learning rate of 1e-5 to 1e-6 for fine-tuning.
  2. If there are additional modules added, load the pre-trained weights and use zero init for inference. This will verify that the initialization or code is correct.
  3. Always keep an eye on the loss curve, if there is spike loss, then it is very likely that the model has collapsed, resume training from the nearest checkpoint. If the training process crashes frequently, then consider increasing the batch size or continue to reduce the learning rate.

Hi Lin, Many thanks for ur reply. My loss curve is very similar to the posted one: Loss decrease without significant spike noticed. And decoding straight from train script https://github.com/PKU-YuanGroup/Open-Sora-Plan/blob/842435c016d358b6e6aa1ec788cbb593c5cdec32/opensora/train/train_t2v_diffusers.py#L631

outputs clear video without noise. Yet inference from the checkpoint will output noisy videos. I noticed the model is trained with DDPM scheduler but inferenced with Euler scheduler, could this cause the unstable case? How do I further debug this? My loss curve FYI, the lr is 43 --learning_rate=1e-4 \ :

image

I also suspected that it is the scheduler problem, but I have changed the scheduler to DDPM when inference, still get noisy videos...

Me too

wtjiang98 commented 2 months ago

For fine-tuning, we offer the following suggestions:

  1. Reduce the learning rate. We recommend a learning rate of 1e-5 to 1e-6 for fine-tuning.
  2. If there are additional modules added, load the pre-trained weights and use zero init for inference. This will verify that the initialization or code is correct.
  3. Always keep an eye on the loss curve, if there is spike loss, then it is very likely that the model has collapsed, resume training from the nearest checkpoint. If the training process crashes frequently, then consider increasing the batch size or continue to reduce the learning rate.

I try loading the pre-trained weights and use zero init for inference, it looks good, seems that the initialization and code are correct. But when I change to my saved checkpoint, it still output noise videos.

My question is: How can the loss curve seems good, the validation when training seems good, but fail on loading saved checkpoints for inference.

LinB203 commented 2 months ago

For fine-tuning, we offer the following suggestions:

  1. Reduce the learning rate. We recommend a learning rate of 1e-5 to 1e-6 for fine-tuning.
  2. If there are additional modules added, load the pre-trained weights and use zero init for inference. This will verify that the initialization or code is correct.
  3. Always keep an eye on the loss curve, if there is spike loss, then it is very likely that the model has collapsed, resume training from the nearest checkpoint. If the training process crashes frequently, then consider increasing the batch size or continue to reduce the learning rate.

Hi Lin,

Many thanks for ur reply.

My loss curve is very similar to the posted one: Loss decrease without significant spike noticed at few thousands steps. And decoding straight from train script

https://github.com/PKU-YuanGroup/Open-Sora-Plan/blob/842435c016d358b6e6aa1ec788cbb593c5cdec32/opensora/train/train_t2v_diffusers.py#L631

outputs clear video without noise. Yet inference from the checkpoint will output noisy videos.

I noticed the model is trained with DDPM scheduler but inferenced with Euler scheduler, could this cause the unstable case? How do I further debug this?

My loss curve FYI, the lr is 43 --learning_rate=1e-4 \ :

image

You can see spike-loss in 6.4k step. Could you inference with 6k checkpoint?

LinB203 commented 2 months ago

对于微调,我们提出以下建议:

  1. 降低学习率。我们建议使用 1e-5 到 1e-6 的学习率进行微调。
  2. 如果添加了其他模块,请加载预先训练的权重并使用零初始化进行推理。这将验证初始化或代码是否正确。
  3. 时刻关注loss曲线,如果出现loss的尖峰,那么很有可能模型崩溃了,从最近的checkpoint恢复训练。如果训练过程频繁崩溃,那么可以考虑增加batch size或者继续降低学习率。

我尝试加载预训练的权重并使用零初始化进行推理,看起来不错,似乎初始化和代码是正确的。但是当我切换到我保存的检查点时,它仍然输出噪音视频。

我的问题是:为什么损失曲线看起来不错,训练时的验证看起来不错,但在加载已保存的检查点进行推理时却失败了。

Do you have spike-loss? Maybe the learningrate is too high.

wtjiang98 commented 2 months ago

对于微调,我们提出以下建议:

  1. 降低学习率。我们建议使用 1e-5 到 1e-6 的学习率进行微调。
  2. 如果添加了其他模块,请加载预先训练的权重并使用零初始化进行推理。这将验证初始化或代码是否正确。
  3. 时刻关注loss曲线,如果出现loss的尖峰,那么很有可能模型崩溃了,从最近的checkpoint恢复训练。如果训练过程频繁崩溃,那么可以考虑增加batch size或者继续降低学习率。

我尝试加载预训练的权重并使用零初始化进行推理,看起来不错,似乎初始化和代码是正确的。但是当我切换到我保存的检查点时,它仍然输出噪音视频。 我的问题是:为什么损失曲线看起来不错,训练时的验证看起来不错,但在加载已保存的检查点进行推理时却失败了。

Do you have spike-loss? Maybe the learningrate is too high.

The following is the loss curve without any smooth. It seems that there is no spike-loss. But the loss at the beginning is quite big. I would try to change my learning rate to 1e-5 or 1e-6.

image

spacegoing commented 2 months ago

对于微调,我们提出以下建议:

  1. 降低学习率。我们建议使用 1e-5 到 1e-6 的学习率进行微调。
  2. 如果添加了其他模块,请加载预先训练的权重并使用零初始化进行推理。这将验证初始化或代码是否正确。
  3. 时刻关注loss曲线,如果出现loss的尖峰,那么很有可能模型崩溃了,从最近的checkpoint恢复训练。如果训练过程频繁崩溃,那么可以考虑增加batch size或者继续降低学习率。

我尝试加载预训练的权重并使用零初始化进行推理,看起来不错,似乎初始化和代码是正确的。但是当我切换到我保存的检查点时,它仍然输出噪音视频。 我的问题是:为什么损失曲线看起来不错,训练时的验证看起来不错,但在加载已保存的检查点进行推理时却失败了。

Do you have spike-loss? Maybe the learningrate is too high.

Yeah I tried checkpoints both before (500, 2500, 5000) and after (12500) the spike, they all outputs noisy videos

My question is: How can the loss curve seems good, the validation when training seems good, but fail on loading saved checkpoints for inference.

And this is also precisely my confusion

LinB203 commented 2 months ago

For training from scratch, this is normal. The point of confusion for me is that if you guys are zero-init from pre-training weights, then the results should be seen very quickly and not be noisy. and from your descriptions you talk about how zero-init‘s model loaded with pre-training weights can inference directly to the normal video.

I think the zero-init model should have a loss comparable to the previous loss value, but apparently it is not. The loss of its starting point shouldn't start at 1.x. In my experience there is no difference between a loss starting at 1.x and training from scratch.

spacegoing commented 2 months ago

For training from scratch, this is normal. The point of confusion for me is that if you guys are zero-init from pre-training weights, then the results should be seen very quickly and not be noisy. and from your descriptions you talk about how zero-init‘s model loaded with pre-training weights can inference directly to the normal video.

I think the zero-init model should have a loss comparable to the previous loss value, but apparently it is not. The loss of its starting point shouldn't start at 1.x. In my experience there is no difference between a loss starting at 1.x and training from scratch.

Many thanks for your reply!

Would u pls elaborate on how to do "zero-init"? For now I'm not introducing any new modules into the model but simply finetune on my custom cartoon dataset instead.

LinB203 commented 2 months ago

For training from scratch, this is normal. The point of confusion for me is that if you guys are zero-init from pre-training weights, then the results should be seen very quickly and not be noisy. and from your descriptions you talk about how zero-init‘s model loaded with pre-training weights can inference directly to the normal video. I think the zero-init model should have a loss comparable to the previous loss value, but apparently it is not. The loss of its starting point shouldn't start at 1.x. In my experience there is no difference between a loss starting at 1.x and training from scratch.

Many thanks for your reply!

Would u pls elaborate on how to do "zero-init"? For now I'm not introducing any new modules into the model but simply finetune on my custom cartoon dataset instead.

Well, if you didn't add the new module additionally, the situation is even more unusual. The loss shouldn't be so high after you loaded the pre-training weights. Please make sure your pre-training weights are loaded. Can you post your script?

spacegoing commented 2 months ago

For training from scratch, this is normal. The point of confusion for me is that if you guys are zero-init from pre-training weights, then the results should be seen very quickly and not be noisy. and from your descriptions you talk about how zero-init‘s model loaded with pre-training weights can inference directly to the normal video. I think the zero-init model should have a loss comparable to the previous loss value, but apparently it is not. The loss of its starting point shouldn't start at 1.x. In my experience there is no difference between a loss starting at 1.x and training from scratch.

Many thanks for your reply! Would u pls elaborate on how to do "zero-init"? For now I'm not introducing any new modules into the model but simply finetune on my custom cartoon dataset instead.

Well, if you didn't add the new module additionally, the situation is even more unusual. The loss shouldn't be so high after you loaded the pre-training weights. Please make sure your pre-training weights are loaded. Can you post your script?

Hi Lin,

Good news, after using pre-training my sft works now! Thanks so much!

Is there any QR code / sponsor link that I can use to buy u a beer?

Thanks again for your help!

wtjiang98 commented 2 months ago

For training from scratch, this is normal. The point of confusion for me is that if you guys are zero-init from pre-training weights, then the results should be seen very quickly and not be noisy. and from your descriptions you talk about how zero-init‘s model loaded with pre-training weights can inference directly to the normal video. I think the zero-init model should have a loss comparable to the previous loss value, but apparently it is not. The loss of its starting point shouldn't start at 1.x. In my experience there is no difference between a loss starting at 1.x and training from scratch.

Many thanks for your reply! Would u pls elaborate on how to do "zero-init"? For now I'm not introducing any new modules into the model but simply finetune on my custom cartoon dataset instead.

Well, if you didn't add the new module additionally, the situation is even more unusual. The loss shouldn't be so high after you loaded the pre-training weights. Please make sure your pre-training weights are loaded. Can you post your script?

Hi Lin,

Good news, after using pre-training my sft works now! Thanks so much!

Is there any QR code / sponsor link that I can use to buy u a beer?

Thanks again for your help!

So your problem is that you didn't load pretrain model, and directly train on your custom cartoon dataset? Then you load the pretrain model to solve it.

spacegoing commented 2 months ago

For training from scratch, this is normal. The point of confusion for me is that if you guys are zero-init from pre-training weights, then the results should be seen very quickly and not be noisy. and from your descriptions you talk about how zero-init‘s model loaded with pre-training weights can inference directly to the normal video. I think the zero-init model should have a loss comparable to the previous loss value, but apparently it is not. The loss of its starting point shouldn't start at 1.x. In my experience there is no difference between a loss starting at 1.x and training from scratch.

Many thanks for your reply! Would u pls elaborate on how to do "zero-init"? For now I'm not introducing any new modules into the model but simply finetune on my custom cartoon dataset instead.

Well, if you didn't add the new module additionally, the situation is even more unusual. The loss shouldn't be so high after you loaded the pre-training weights. Please make sure your pre-training weights are loaded. Can you post your script?

Hi Lin, Good news, after using pre-training my sft works now! Thanks so much! Is there any QR code / sponsor link that I can use to buy u a beer? Thanks again for your help!

So your problem is that you didn't load pretrain model, and directly train on your custom cartoon dataset? Then you load the pretrain model to solve it.

yes exactly LoL. I'd also love to know what zero-init means though...

wtjiang98 commented 2 months ago

For training from scratch, this is normal. The point of confusion for me is that if you guys are zero-init from pre-training weights, then the results should be seen very quickly and not be noisy. and from your descriptions you talk about how zero-init‘s model loaded with pre-training weights can inference directly to the normal video. I think the zero-init model should have a loss comparable to the previous loss value, but apparently it is not. The loss of its starting point shouldn't start at 1.x. In my experience there is no difference between a loss starting at 1.x and training from scratch.

Many thanks for your reply! Would u pls elaborate on how to do "zero-init"? For now I'm not introducing any new modules into the model but simply finetune on my custom cartoon dataset instead.

Well, if you didn't add the new module additionally, the situation is even more unusual. The loss shouldn't be so high after you loaded the pre-training weights. Please make sure your pre-training weights are loaded. Can you post your script?

Hi Lin, Good news, after using pre-training my sft works now! Thanks so much! Is there any QR code / sponsor link that I can use to buy u a beer? Thanks again for your help!

So your problem is that you didn't load pretrain model, and directly train on your custom cartoon dataset? Then you load the pretrain model to solve it.

yes exactly LoL. I'd also love to know what zero-init means though...

When you try to add some new modules and inputs (like depth map, skeleton image) in the original model, zero-init require you to set the bias and weights to be zero at the beginning of the training. In this situation, the newly added inputs and modules won't take affect at first, which allows the fine-tuning more stable.

I though Lin's comment ("There is no difference between a loss starting at 1.x and training from scratch") is inspiring. I think I should check my zero-init.

What I have learned through all of this is that: the loss graph can only reflect very limited information in training a diffusion model.

wtjiang98 commented 2 months ago

It's the zero-init problem. After using the correct zero-init, the loss begins at 0.03. The inference results are not noise. Many thanks to @LinB203 for the insightful suggestion.

1KE-JI commented 2 months ago

spacegoing

Hi bro, I would like to ask what your successful experience is, and what hyperparameters did you use to train 480p? When I adjusted the learning rate to 1e-6, the inference was still noisy.

spacegoing commented 2 months ago

spacegoing

Hi bro, I would like to ask what your successful experience is, and what hyperparameters did you use to train 480p? When I adjusted the learning rate to 1e-6, the inference was still noisy.

as Lin suggested (I guess u missed the training section in readme), there are two modes: load from checkpoint and pretrain mode.

when changing dataset, u need to use the pretrain mode

1KE-JI commented 2 months ago

spacegoing

Hi bro, I would like to ask what your successful experience is, and what hyperparameters did you use to train 480p? When I adjusted the learning rate to 1e-6, the inference was still noisy.

as Lin suggested (I guess u missed the training section in readme), there are two modes: load from checkpoint and pretrain mode.

when changing dataset, u need to use the pretrain mode

Yes, I do use the --pretrained parameter to load the open-sora-plan 93*480p checkpoint and the training loss is begin with 0.09. I want to ask what is the "pretrain mode"

spacegoing commented 2 months ago

@1KE-JI That's exactly what I meant. In this case we have different problems. Maybe reopen this issue / create a new one with more details. I'll help u debug.