G-U-N / AnimateLCM

[SIGGRAPH ASIA 2024 TCS] AnimateLCM: Computation-Efficient Personalized Style Video Generation without Personalized Video Data
https://animatelcm.github.io
MIT License
612 stars 45 forks source link

Issue in using training AnimateLCM SVD #22

Open habibian opened 6 months ago

habibian commented 6 months ago

Thanks for the great work, also for releasing the training script train_svd_lcm.py.

I am trying to reproduce the results using the provided train_svd_lcm.py, but after half of the training (20,000 / 50,000 itrs) don't see any improvement neither in loss value nor generation qualities (training on a single A100 on WebVid2M).

Could you please confirm if Ishould set the hyper-params as follows?

accelerate launch train_svd_lcm.py \ --pretrained_model_name_or_path=stabilityai/stable-video-diffusion-img2vid-xt \ --per_gpu_batch_size=1 --gradient_accumulation_steps=1 \ --max_train_steps=50000 \ --width=576 \ --height=320 \ --checkpointing_steps=1000 --checkpoints_total_limit=1 \ --learning_rate=1e-6 --lr_warmup_steps=1000 \ --seed=123 \ --adam_weight_decay=1e-3 \ --mixed_precision="fp16" \ --N=40 \ --validation_steps=500 \ --enable_xformers_memory_efficient_attention \ --gradient_checkpointing \ --output_dir="outputs" \

In the current train_svd_lcm.py, the model is being trained on 576x320 resolutions, which is much lower than the standard SVD, i.e., 1024x572. Would not this cause a problem as normal (non LCM) SVD suffer from generating lower resolution videos?

Any input is much appreciated :)

G-U-N commented 6 months ago

Hi, thanks for the interest!

G-U-N commented 6 months ago

Also would say that the default hyper-parameters applied in the training script are not carefully tailored and should just be sub-optimal. For example, using EMA should generally increase the generation stability.

habibian commented 6 months ago

Thanks for the swift response :)

I am now switching to 4xA100 and unfortunately still see vague blobs like the attachment. Curious to know at what iterations should I expect the generations start looking like a video? :)

Thanks!

step_5000_val_img_7_2_1steps

G-U-N commented 6 months ago

The results uploaded seem to be abnormal. It should not flash like this with unnatural colour. Here's what I obtained trained on 576x320.

Training beginning, 0-iter, cfg = 1, inference step = 4 step_1_val_img_7_1_4steps

10k iter, cfg=1, inference step = 4 step_14500_val_img_7_1_4steps

The devices are 8 A 800, and the batch size is set to 8 without gradient accumulation.

G-U-N commented 6 months ago

I just found the code at this line was a typo, and I fixed it. Just hope it did not mislead you.

habibian commented 6 months ago

Amazing! It start to look good after fixing the typo.

ThanQ :)

G-U-N commented 6 months ago

Awesome! Very glad to hear that : D.

habibian commented 6 months ago

Hey Fu-Yun,

After fixing the typo, I have been training the model on 8xA100s, which should be exactly like your setting then. Unfortunately, I still can't match your generations:

Training beginning, 0-iter, cfg = 1, inference step = 4 step_1_val_img_7_1_4steps

10k iter, cfg=1, inference step = 4 step_10000_val_img_7_1_4steps

20k iter, cfg=1, inference step = 4 step_20000_val_img_7_1_4steps

Any suggestion on why this is happening?

I suspect it might be from the data. Currently I am training on WebVid2M-train (results_2M_train.csv with 2.5M videos) without any particular subsampling (based on resolution, content, etc.). Could you please elaborate a bit your training data?

Also, my dataloader does not do any particular transformation/augmentation except for normalizing pixel values to [-1, 1]. Would be great if you can share your WebVid dataloader if there is any particular detail missing.

Again, thanks a lot for your great contribution :)

G-U-N commented 6 months ago

Hey @habibian, just uploaded an example dataset.py.

In addition to that, I would recommend freezing all the convolutional layers when training because convolution layers seem to be more vulnerable for fine-tuning.

Hope this will help for better performance.

habibian commented 6 months ago

Thanks for the response @G-U-N .

Regarding the freezing the convolutional layers, do you mean the ones in ResBlocks? Is is part of your implementation, or I need to implement it?

Thanks!

G-U-N commented 6 months ago

Hi @habibian,

Yes, the ResBlocks. That was not implemented in the training script. But it should be easy to achieve that through modifying this line.

habibian commented 6 months ago

Hey @G-U-N ,

Thanks for the input. Following your suggestion, I kept conv layers in resblocks frozen during the training as:

    for name, para in unet.named_parameters():
        # freeze resnet convs as suggested in https://github.com/G-U-N/AnimateLCM/issues/22#issuecomment-2094802365
        if 'conv' in name and not ('conv_in' in name or 'conv_out' in name):
            para.requires_grad = False
        else:
            para.requires_grad = True
            parameters_list.append(para)

I actually observe some improvements in training with this modification as:

Convs Frozen: 20k iter, cfg=1, inference step = 4 step_20000_val_img_7_1_4steps_frozen

All Finetuned: 20k iter, cfg=1, inference step = 4 step_20000_val_img_7_1_4steps

However, I still see my trained models to have much lower quality compared to the SVD checkpoint that you guys have released SVD checkpoint: . Here are some more test examples to give you some idea about how poor the quality of my replications are. So wonder if you have trained SVD checkpoint as I am doing here, or maybe there are some differences, i.e., in code, data, etc?

Thanks a lot for your guidance and support in replicating your excellent work :)

Convs Frozen: 20k iter, cfg=1, inference step = 4 step_20000_val_img_4_1_4steps step_20000_val_img_3_1_4steps step_20000_val_img_2_1_4steps step_20000_val_img_1_1_4steps

G-U-N commented 6 months ago

Hey @habibian , I would say there's no too much difference. The only difference is that I tried to freeze more weights at the beginning of training instead of fully fine-tuing. I didn't do too much ablation on that due to my limited GPU resources.

What about trying this:

 for name, para in unet.named_parameters():
    if "transformer_block" in name and "temporal_transformer_block" not in name:
        para.requires_grad = True
        parameters_list.append(para)

Again, I would recommend logging the generated videos in resolution 1024 x 576. You will not get ideal results on low resolutions even if you train the model successfully.

LMK if you get better results.

G-U-N commented 6 months ago

Hi @habibian, just checking in to see if you have any updates. Hope everything is going well on your end!

habibian commented 6 months ago

Hey @G-U-N

Thanks for the suggestion and your great support here, much appreciated!

Following your last suggestion, instead of finetuning all except resblocks I am now only finetuning spatial_transformer_blocks that is actually improving the results as follows:

Finetuning all except resblocks: 20k iter, cfg=1, inference step = 4 step_20000_val_img_2_1_4steps

Finetuning spatial_transformer_blocks: 20k iter, cfg=1, inference step = 4 step_19000_val_img_2_1_4steps

Finetuning all except resblocks: 20k iter, cfg=1, inference step = 4 step_20000_val_img_3_1_4steps

Finetuning spatial_transformer_blocks: 20k iter, cfg=1, inference step = 4 step_19000_val_img_3_1_4steps

Finetuning all except resblocks: 20k iter, cfg=1, inference step = 4 step_20000_val_img_4_1_4steps

Finetuning spatial_transformer_blocks: 20k iter, cfg=1, inference step = 4 step_19000_val_img_4_1_4steps

Finetuning all except resblocks: 20k iter, cfg=1, inference step = 4 step_20000_val_img_7_1_4steps

Finetuning spatial_transformer_blocks: 20k iter, cfg=1, inference step = 4 step_19000_val_img_7_1_4steps

And, here are the 1024 x 576 generated videos using my trained checkpoint (compared to your released checkpoint):

Finetuning spatial_transformer_blocks: 20k iter, cfg=1, inference step = 4 000000

You released AnimateLCM-SVD-xt-1.1 checkpoint: ? iter, cfg=1, inference step = 4 000000

Finetuning spatial_transformer_blocks: 20k iter, cfg=1, inference step = 4 000000

You released AnimateLCM-SVD-xt-1.1 checkpoint: ? iter, cfg=1, inference step = 4 000000

As you see, there is still a gap in generation qualities, which I am not sure how can be reduced. Is the released checkpoint trained with 50K iterations? Any particular multi-stage training or lr scheduling involved?

Thanks :)

G-U-N commented 6 months ago

Hey @habibian. Very glad to see the improvement! And I really appreciate the detailed visual ablations.

I actually conducted the training in two-stage.

Additionally, some more iterations on larger resolutions will help enhance the performance.

Hope this will make better performance!

habibian commented 6 months ago

Hey @G-U-N ,

Great, thanks for the elaboration. I will follow this multi stage training and get back to you about results.

For that, could you please describe a bit the details of the large resolution training? More specifically:

Thanks!

G-U-N commented 6 months ago

@habibian

The details:

Training videos: bilinear interpolated webvid-2M. If you have other video dataset with larger resolution, that will be great. Resolution: Only spatial transformer block, a 80 GB GPU should be able to train on resolution 1024x576. Only temporal transformer block, a 80 GB GPU should be able to train on resolution 768x448. Iterations: 10k~30k learning rate: 1e-6

ersanliqiao commented 6 months ago

Hey @habibian. Very glad to see the improvement! And I really appreciate the detailed visual ablations.

I actually conducted the training in two-stage.

  • 30k iterations with only spatial transformer block tuned with learning rate 1e-6.
if "temporal_transformer_block" not in name and "transformer_block" in name
  • 50k iterations with only temporal transformer block tuned with learning rate 3e-7. (The temporal weights of SVD is relatively large and vulnerable.)
if "temporal_transformer_block" in name

Additionally, some more iterations on larger resolutions will help enhance the performance.

Hope this will make better performance!

Hi, I think in stage 2, it should use unet weight saved from stage1 to initialize the unet weights of stage2, but target unet and teacher unet should be initialized from stalibity svd xt? Am I right? But the code seemed not support this??

G-U-N commented 6 months ago

Hey @ersanliqiao.

You should load the unet and target unet from your finetuned weight and initialize the teacher unet with stability weight.

Try this at this line

from safetensors.torch import load_file
finetuned_weight = load_file("xxx.safetensors","cpu")
unet.load_state_dict(finetuned_weight)
target_unet.load_state_dict(finetuned_weight)
del finetuned_weight
ersanliqiao commented 6 months ago

thank you!!

dreamyou070 commented 5 months ago

hi @habibian can i ask you why you are trying to train the model? I am trying to use AnimateLCM model, but do not check weather training is better or not yet. Do you have any specific reason?

habibian commented 5 months ago

hi @dreamyou070

I needed to retrain AnimateLCM on a different UNet to run faster than standard SVD architecture.

haohang96 commented 5 months ago

Hi @G-U-N, thanks for your great open-source work

I have some questions about loss weighting when training svd-lcm (codes): loss = torch.mean(weights) * ...,

where the weights is defined here:

self.weights = (1/(self.sigmas[:-1] - self.sigmas[1:]))**0.1

This formulation seems a bit different from the representation of λn in the arXiv paper: $$\lambda_n = ((1 - \delta \frac{n}{N}))^{\gamma}$$

I'd like to know if the formulation used in the code is based on any reference paper or if it is just a heuristic setting.

G-U-N commented 5 months ago

Hey, @haohang96 . Yes, I would say the choice of weights is very heuristic and hard to give an explicit analysis. Most designs are heuristic and should be sub-optimal.

weleen commented 3 months ago

@habibian Hi, have you obtained results similar to the released AnimateLCM-svd-xt? I fine-tuned the Spatial Transformer layer for 30k iterations, the results appear as blurry as what you've shown above.

bird-8014191_1280-576-1024-4-1-1-False halloween-4585684_1280-576-1024-4-1-1-False leaf-7260246_1280-576-1024-4-1-1-False squirrel-7985502_1280-576-1024-4-1-1-False woman-4549327_1280-576-1024-4-1-1-False

weleen commented 3 months ago

trainable parameters are set as follows:


    unet.requires_grad_(False)
    parameters_list = []

    # Customize the parameters that need to be trained; if necessary, you can uncomment them yourself.

    for name, para in unet.named_parameters():
        # 1 stage: 30k iterations with only spatial transformer block tuned with learning rate 1e-6.
        #  Only temporal transformer block, a 80 GB GPU should be able to train on resolution 768x448.
        if args.training_stage == 1:
            if "temporal_transformer_blocks" not in name and "transformer_blocks" in name:
                para.requires_grad = True
                parameters_list.append(para)
        elif args.training_stage == 2:
        # 2 stage: 50k iterations with only temporal transformer block tuned with learning rate 3e-7. (The temporal weights of SVD is relatively large and vulnerable.)
        # Only spatial transformer block, a 80 GB GPU should be able to train on resolution 1024x576.
            if "temporal_transformer_blocks" in name:
                para.requires_grad = True
                parameters_list.append(para)