Open habibian opened 6 months ago
Hi, thanks for the interest!
Also would say that the default hyper-parameters applied in the training script are not carefully tailored and should just be sub-optimal. For example, using EMA
should generally increase the generation stability.
Thanks for the swift response :)
I am now switching to 4xA100 and unfortunately still see vague blobs like the attachment. Curious to know at what iterations should I expect the generations start looking like a video? :)
Thanks!
The results uploaded seem to be abnormal. It should not flash like this with unnatural colour. Here's what I obtained trained on 576x320.
Training beginning, 0-iter, cfg = 1, inference step = 4
10k iter, cfg=1, inference step = 4
The devices are 8 A 800, and the batch size is set to 8 without gradient accumulation.
I just found the code at this line was a typo, and I fixed it. Just hope it did not mislead you.
Amazing! It start to look good after fixing the typo.
ThanQ :)
Awesome! Very glad to hear that : D.
Hey Fu-Yun,
After fixing the typo, I have been training the model on 8xA100s, which should be exactly like your setting then. Unfortunately, I still can't match your generations:
Training beginning, 0-iter, cfg = 1, inference step = 4
10k iter, cfg=1, inference step = 4
20k iter, cfg=1, inference step = 4
Any suggestion on why this is happening?
I suspect it might be from the data. Currently I am training on WebVid2M-train (results_2M_train.csv
with 2.5M videos) without any particular subsampling (based on resolution, content, etc.). Could you please elaborate a bit your training data?
Also, my dataloader does not do any particular transformation/augmentation except for normalizing pixel values to [-1, 1]. Would be great if you can share your WebVid dataloader if there is any particular detail missing.
Again, thanks a lot for your great contribution :)
Hey @habibian, just uploaded an example dataset.py.
In addition to that, I would recommend freezing all the convolutional layers when training because convolution layers seem to be more vulnerable for fine-tuning.
Hope this will help for better performance.
Thanks for the response @G-U-N .
Regarding the freezing the convolutional layers, do you mean the ones in ResBlocks? Is is part of your implementation, or I need to implement it?
Thanks!
Hi @habibian,
Yes, the ResBlocks. That was not implemented in the training script. But it should be easy to achieve that through modifying this line.
Hey @G-U-N ,
Thanks for the input. Following your suggestion, I kept conv
layers in resblocks
frozen during the training as:
for name, para in unet.named_parameters():
# freeze resnet convs as suggested in https://github.com/G-U-N/AnimateLCM/issues/22#issuecomment-2094802365
if 'conv' in name and not ('conv_in' in name or 'conv_out' in name):
para.requires_grad = False
else:
para.requires_grad = True
parameters_list.append(para)
I actually observe some improvements in training with this modification as:
Convs Frozen: 20k iter, cfg=1, inference step = 4
All Finetuned: 20k iter, cfg=1, inference step = 4
However, I still see my trained models to have much lower quality compared to the SVD checkpoint that you guys have released SVD checkpoint: . Here are some more test examples to give you some idea about how poor the quality of my replications are. So wonder if you have trained SVD checkpoint as I am doing here, or maybe there are some differences, i.e., in code, data, etc?
Thanks a lot for your guidance and support in replicating your excellent work :)
Convs Frozen: 20k iter, cfg=1, inference step = 4
Hey @habibian , I would say there's no too much difference. The only difference is that I tried to freeze more weights at the beginning of training instead of fully fine-tuing. I didn't do too much ablation on that due to my limited GPU resources.
What about trying this:
for name, para in unet.named_parameters():
if "transformer_block" in name and "temporal_transformer_block" not in name:
para.requires_grad = True
parameters_list.append(para)
Again, I would recommend logging the generated videos in resolution 1024 x 576. You will not get ideal results on low resolutions even if you train the model successfully.
LMK if you get better results.
Hi @habibian, just checking in to see if you have any updates. Hope everything is going well on your end!
Hey @G-U-N
Thanks for the suggestion and your great support here, much appreciated!
Following your last suggestion, instead of finetuning all except resblocks
I am now only finetuning spatial_transformer_blocks
that is actually improving the results as follows:
Finetuning all except resblocks: 20k iter, cfg=1, inference step = 4
Finetuning spatial_transformer_blocks: 20k iter, cfg=1, inference step = 4
Finetuning all except resblocks: 20k iter, cfg=1, inference step = 4
Finetuning spatial_transformer_blocks: 20k iter, cfg=1, inference step = 4
Finetuning all except resblocks: 20k iter, cfg=1, inference step = 4
Finetuning spatial_transformer_blocks: 20k iter, cfg=1, inference step = 4
Finetuning all except resblocks: 20k iter, cfg=1, inference step = 4
Finetuning spatial_transformer_blocks: 20k iter, cfg=1, inference step = 4
And, here are the 1024 x 576 generated videos using my trained checkpoint (compared to your released checkpoint):
Finetuning spatial_transformer_blocks: 20k iter, cfg=1, inference step = 4
You released AnimateLCM-SVD-xt-1.1 checkpoint: ? iter, cfg=1, inference step = 4
Finetuning spatial_transformer_blocks: 20k iter, cfg=1, inference step = 4
You released AnimateLCM-SVD-xt-1.1 checkpoint: ? iter, cfg=1, inference step = 4
As you see, there is still a gap in generation qualities, which I am not sure how can be reduced. Is the released checkpoint trained with 50K iterations? Any particular multi-stage training or lr scheduling involved?
Thanks :)
Hey @habibian. Very glad to see the improvement! And I really appreciate the detailed visual ablations.
I actually conducted the training in two-stage.
if "temporal_transformer_block" not in name and "transformer_block" in name
if "temporal_transformer_block" in name
Additionally, some more iterations on larger resolutions will help enhance the performance.
Hope this will make better performance!
Hey @G-U-N ,
Great, thanks for the elaboration. I will follow this multi stage training and get back to you about results.
For that, could you please describe a bit the details of the large resolution training? More specifically:
Thanks!
@habibian
The details:
Training videos: bilinear interpolated webvid-2M. If you have other video dataset with larger resolution, that will be great. Resolution: Only spatial transformer block, a 80 GB GPU should be able to train on resolution 1024x576. Only temporal transformer block, a 80 GB GPU should be able to train on resolution 768x448. Iterations: 10k~30k learning rate: 1e-6
Hey @habibian. Very glad to see the improvement! And I really appreciate the detailed visual ablations.
I actually conducted the training in two-stage.
- 30k iterations with only spatial transformer block tuned with learning rate 1e-6.
if "temporal_transformer_block" not in name and "transformer_block" in name
- 50k iterations with only temporal transformer block tuned with learning rate 3e-7. (The temporal weights of SVD is relatively large and vulnerable.)
if "temporal_transformer_block" in name
Additionally, some more iterations on larger resolutions will help enhance the performance.
Hope this will make better performance!
Hi, I think in stage 2, it should use unet weight saved from stage1 to initialize the unet weights of stage2, but target unet and teacher unet should be initialized from stalibity svd xt? Am I right? But the code seemed not support this??
Hey @ersanliqiao.
You should load the unet and target unet from your finetuned weight and initialize the teacher unet with stability weight.
Try this at this line
from safetensors.torch import load_file
finetuned_weight = load_file("xxx.safetensors","cpu")
unet.load_state_dict(finetuned_weight)
target_unet.load_state_dict(finetuned_weight)
del finetuned_weight
thank you!!
hi @habibian can i ask you why you are trying to train the model? I am trying to use AnimateLCM model, but do not check weather training is better or not yet. Do you have any specific reason?
hi @dreamyou070
I needed to retrain AnimateLCM on a different UNet to run faster than standard SVD architecture.
Hi @G-U-N, thanks for your great open-source work
I have some questions about loss weighting when training svd-lcm (codes):
loss = torch.mean(weights) * ...
,
where the weights is defined here:
self.weights = (1/(self.sigmas[:-1] - self.sigmas[1:]))**0.1
This formulation seems a bit different from the representation of λn in the arXiv paper: $$\lambda_n = ((1 - \delta \frac{n}{N}))^{\gamma}$$
I'd like to know if the formulation used in the code is based on any reference paper or if it is just a heuristic setting.
Hey, @haohang96 . Yes, I would say the choice of weights is very heuristic and hard to give an explicit analysis. Most designs are heuristic and should be sub-optimal.
@habibian Hi, have you obtained results similar to the released AnimateLCM-svd-xt? I fine-tuned the Spatial Transformer layer for 30k iterations, the results appear as blurry as what you've shown above.
trainable parameters are set as follows:
unet.requires_grad_(False)
parameters_list = []
# Customize the parameters that need to be trained; if necessary, you can uncomment them yourself.
for name, para in unet.named_parameters():
# 1 stage: 30k iterations with only spatial transformer block tuned with learning rate 1e-6.
# Only temporal transformer block, a 80 GB GPU should be able to train on resolution 768x448.
if args.training_stage == 1:
if "temporal_transformer_blocks" not in name and "transformer_blocks" in name:
para.requires_grad = True
parameters_list.append(para)
elif args.training_stage == 2:
# 2 stage: 50k iterations with only temporal transformer block tuned with learning rate 3e-7. (The temporal weights of SVD is relatively large and vulnerable.)
# Only spatial transformer block, a 80 GB GPU should be able to train on resolution 1024x576.
if "temporal_transformer_blocks" in name:
para.requires_grad = True
parameters_list.append(para)
Thanks for the great work, also for releasing the training script
train_svd_lcm.py
.I am trying to reproduce the results using the provided
train_svd_lcm.py
, but after half of the training (20,000 / 50,000 itrs) don't see any improvement neither in loss value nor generation qualities (training on a single A100 on WebVid2M).Could you please confirm if Ishould set the hyper-params as follows?
accelerate launch train_svd_lcm.py \
--pretrained_model_name_or_path=stabilityai/stable-video-diffusion-img2vid-xt \
--per_gpu_batch_size=1 --gradient_accumulation_steps=1 \
--max_train_steps=50000 \
--width=576 \
--height=320 \
--checkpointing_steps=1000 --checkpoints_total_limit=1 \
--learning_rate=1e-6 --lr_warmup_steps=1000 \
--seed=123 \
--adam_weight_decay=1e-3 \
--mixed_precision="fp16" \
--N=40 \
--validation_steps=500 \
--enable_xformers_memory_efficient_attention \
--gradient_checkpointing \
--output_dir="outputs" \
In the current
train_svd_lcm.py
, the model is being trained on576x320
resolutions, which is much lower than the standard SVD, i.e.,1024x572
. Would not this cause a problem as normal (non LCM) SVD suffer from generating lower resolution videos?Any input is much appreciated :)