dvlab-research / ControlNeXt

Controllable video and image Generation, SVD, Animate Anyone, ControlNet, ControlNeXt, LoRA
Apache License 2.0
1.39k stars 66 forks source link

Training Experiments and Insights #14

Closed BJQ123456 closed 2 months ago

BJQ123456 commented 2 months ago

感觉论文里面错误好多啊

Pbihao commented 2 months ago

Hello, thanks for your feedback!

Yes, the first arXiv paper was a bit rushed and lacks some refinement. Since this project is still under development, it might not fully align with the paper at this stage.

We will continue to refine the paper until the final version is ready.

We would greatly appreciate any help you can offer in correcting it!

BJQ123456 commented 2 months ago

Thanks for your reply. I have a few questions:

1.The text mentions "plug-and-play," but the base model was trained. Isn't this contradictory?

2.Is there no comparison with methods like ControlNet?

3.Formulas (8) and (9) seem incorrect.

4.Figure 5 does not explain what the three results represent.

5.Figure 8 is unclear.

6.Figure 9 does not explain the conditions of the experiment on the left side.

Pbihao commented 2 months ago

Hello, thanks for your question, and I think that they are all good questions! I will share more details.

  1. One of the most important findings is that directly training the base model yields better performance compared to methods like LoRA, Adapter, and others.Even when we train the base model, we only select a small subset of the pre-trained parameters and do not conflict with the 'plug and play' concept. You can think of it as a specialized version of LoRA—more direct and straightforward. 1.1. We would like to share more experiences. As I’ve mentioned, we only select a small subset of parameters, which is fully adapted to the SD1.5 and SDXL backbones. By training fewer than 100 million parameters, we still achieve excellent performance. But this is is not suitable for the SD3 and SVD training. This is because, after SDXL, Stability faced significant legal risks due to the generation of highly realistic human images. After that, they stopped refining their models on human-related data, such as SVD and SD3, to avoid potential risks. To achieve optimal performance, it's necessary to first continue training SVD and SD3 on human-related data to develop a robust backbone before fine-tuning. Of course, you can also combine the continual pretraining and finetuning. So you can find that we direct provide the full SVD parameters. Although this may not be directly related to academia, it is crucial for achieving good performance.
Pbihao commented 2 months ago

Since the paper has limited space, I would like to share additional experiences. Can I change the title to 'Training Experiments and Insights'? @BJQ123456

SVD-related

  1. Data Due to privacy policies, we are unable to share the data. However, data quality is crucial, as many videos on the internet are highly compressed. It’s important to focus on collecting high-quality data.
  2. Pose alignment. Thanks mimic. SVD performs poorly, especially with large motions. Therefore, it is important to avoid large movements and shifts. So please note that in preprocess, there is a alignment between the refenrece image and pose. This is crucial.
  3. Hands Generating hands is a challenging problem in both video and image generation. To address this, we focus on the following strategies: a. Use clear and high-quality data, which is crucial for accurate generation. b. Since the hands occupy a relatively small area, we apply a larger scale for the loss function specifically for this region to improve the generation quality.
  4. Magic number You can find that we adopt a magic nuber when adding the conditions. You can adjust this.

We spent a lot of time to find these tips, now share all with all of you. May there will help you!

Pbihao commented 2 months ago

2.We have compared the efficiency and training convergence. More detailed results will be added later. 3.Have been corrected. 4.5.6.: Thanks, we will refine these details

BJQ123456 commented 2 months ago

Thank you for your response; it was very helpful.

BJQ123456 commented 2 months ago

There's one point I still don't understand. If it's plug-and-play, does it mean that after training, I can directly use it on odels? But during training, a part of the base model was trained, so if I insert it into a new model, that part hasn't been trained and could lead to a performance drop, right? thanks for your reply

Pbihao commented 2 months ago

Yes, as demonstrated in our experiments with SD1.5 and SDXL, we trained on a single backbone and then conducted experiments across various backbones. The results show that our method effectively performs control on different backbones.

We also considered this approach and initially attempted to store the weight increments, same as LoRA only without low-rank compression. However, we eventually found that this step was unnecessary.

BJQ123456 commented 2 months ago

I get it, thanks

nighting0le01 commented 1 month ago

hi could the authors also share light on the combining this with IP-ADAPTERS? will it cause any issues making it work with IP-adapters? does it work well with pretrained IP-Adapters?