Closed seel-channel closed 1 year ago
Hey! As it currently stands, finetuning with this many sample frames are unfeasible.
Many SOTA implementations (video diffusion, Make-A-Video, etc.) sample from roughly ~16 frames. This is one of the reasons why the current models are trained on a resolution of 256x256, or use upscaling networks (like Imagen) to generate at 64x64 then go up as needed.
I would recommend either using xformers or Pytorch 2.0 to see if you can squeeze a bit more performance while keeping the sample frames around 4-8. As the model was already trained on a fairly large domain, you should be fine using a low frame sample account and the model will pick up on the temporal coherency.
As for the video example you've posted, I don't see why it wouldn't be able to. Maybe you could try and report back?
Hope that helps!
Hey, I'm currently running some experiments with training this model, and using torch 2, a fresh xformers, with 24Gb of VRAM I can do n_sample_frames: 48, with 384 x 256 inputs, (20x 2 sec), while having modules set to "attn1" and "attn2", and text_encoder training on. after getting the encoder hidden states I moved the text encoder back to cpu, and at frame preparation I also move the vae back and forth. But I'll update to the latest now, and try some higher res samples.
Does Higher resolution need more VRAM?
After updated to Torch 2.0. I can't increase n_sample_frames higher than 24
--config
pretrained_model_path: "\\weights\\text-to-video-ms-1.7b" #https://huggingface.co/damo-vilab/text-to-video-ms-1.7b/tree/main
output_dir: "weights"
train_text_encoder: False
train_data:
json_path: "train_data.json"
preprocessed: True
width: 256
height: 256
sample_start_idx: 0
sample_frame_rate: 1 # Proccess an image every `sample_frame_rate`
n_sample_frames: 24
use_random_start_idx: False
shuffle_frames: False
vid_data_key: "video_path"
single_video_path: ""
single_video_prompt: ""
validation_data:
prompt: ""
sample_preview: True
num_frames: 100
width: 256
height: 256
num_inference_steps: 20
guidance_scale: 9
learning_rate: 5e-6
adam_weight_decay: 1e-2
train_batch_size: 1
max_train_steps: 50000
checkpointing_steps: 5000
validation_steps: 5000
trainable_modules:
- "attn1"
- "attn2"
seed: 64
mixed_precision: "fp16"
use_8bit_adam: False # This seems to be incompatible at the moment.
enable_xformers_memory_efficient_attention: True
Hey, I'm currently running some experiments with training this model, and using torch 2, a fresh xformers, with 24Gb of VRAM I can do n_sample_frames: 48, with 384 x 256 inputs, (20x 2 sec), while having modules set to "attn1" and "attn2", and text_encoder training on. after getting the encoder hidden states I moved the text encoder back to cpu, and at frame preparation I also move the vae back and forth. But I'll update to the latest now, and try some higher res samples.
Can you share your --config file?
I was experiencing yesterday with A100 80gb. But I think something is wrong, the it/s didn't pass 2. And maximum used vRAM was 41gb. It looks too much low compared to training dreambooth on images that can reaches 15-30 it/s if I remember well. I was training hd with videos of 2 frames and 512x512, but maybe it's because I had just 10 of them? It doesn't matter the number of batch I put it don't speed up. I'll try today with more videos. How could I use GPU at maximum speed?
@sergiobr Training is going to be bit slower due to the extra added temporal dimension.
Look at it this way. Before, we had this:
(2D UNet Latents): b c h w
Where b == batch, c == channels, h == height, w == width.
Now we have:
(3D UNet Latents) : b c f w h
Where it's the same as above, but now we have the frame information as f
.
Now with these two in mind, we now have a temporal transformer for processing the temporal information (relations across time), so an added attention layer and convolution layer (which is filled with 4 3D convolution passes, last acting as identity).
These two layers alone not only increase the amount of memory, but computation time. Remember that each transformer has two attention layers. One for self attention, and another for cross attention layer (relation between image data and text).
If you increase the resolution and use two frames, that's the equivalent of running a batch size of 2 in terms of memory usage. If you increase the batch size and keep the frames the same, you could potentially be doing b * f,
increasing VRAM usage. The more frames you add, the slower the training will be as it's more information to process.
If you want to try reducing memory usage, you could try only finetuning the second to last layers on the encoder and decoder blocks as such.
trainable_modules:
- "down_blocks.2.attentions"
- "down_blocks.2.temp"
- "up_blocks.2.attentions"
- "up_blocks.2.temp"
I'm still working on memory optimizations (no guarantees, but making progress), but speed would come through either mini batching, preprocessing data (resizing videos to match input), or xformers / Scaled Dot Product Attention through Torch 2.0.
@ExponentialML Thank ou very much for detailed explanation. I'm very curious about these deep working of neural networks and I wanna go deep and understand it better.
I was experiencing yesterday with A100 80gb. But I think something is wrong, the it/s didn't pass 2. And maximum used vRAM was 41gb. It looks too much low compared to training dreambooth on images that can reaches 15-30 it/s if I remember well. I was training hd with videos of 2 frames and 512x512, but maybe it's because I had just 10 of them? It doesn't matter the number of batch I put it don't speed up. I'll try today with more videos. How could I use GPU at maximum speed?
@sergiobr, Which service did you use?
Every 256x256 frame costs 0.5 Gb VRAM. As a base, the model needs ~14 GB VRAM
Frames (count) | Consumption (VRAM Gb) |
---|---|
26 | 23.7 |
25 | 23.2 |
24 | 22.7 |
01 | 14.1 |
trainable_modules:
- "down_blocks.2.attentions"
- "down_blocks.2.temp"
- "up_blocks.2.attentions"
- "up_blocks.2.temp"
@sergiobr & @seel-channel I've updated the repository with gradient checkpointing support.
Can you pull the latest update and see if you see an improvement by adding gradient_checkpointing: True
to your configs? You should see a loss in training speed but improvement in VRAM usage.
It was tested on a 3090 Ti 24 GB VRAM | gradient_checkpointing: |
24 frames (VRAM) | 50 frames (VRAM) | 75 frames (VRAM) | 100 frames (VRAM) |
---|---|---|---|---|---|
True | 6 hours (13.6 GB) | 10.27 hours (20.3 GB) | 12.6 hours (23.4 GB) | ---- (OUT VRAM) | |
False | 5 hours (23.7 GB) | ---- (OUT VRAM) | ---- (OUT VRAM) | ---- (OUT VRAM) |
@seel-channel Glad to see the great improvement! 8~16 frames is more than enough context for most cases, so training with at least <= 16GB VRAM should be the norm now with all attention layers unlocked. I'm not certain if the model can retain that much information without using a different attention mechanism all together for the temporal layers.
Convolution layers are still tricky to finetune, as well as well as getting around the increased VRAM usage. I'll update the default configs to better match different use cases.
I'll keep this issue open as I think there's a bit more room to play with here for 12GB VRAM users possibly.
Is it a good idea train the text encoder?
Is it a good idea train the text encoder?
I've tried it, but wasn't able to get good results like with the image models.
I've tried it, but wasn't able to get good results like with the image models.
Does it improve the quality or not? Other question, what offset noise?
I was experiencing yesterday with A100 80gb. But I think something is wrong, the it/s didn't pass 2. And maximum used vRAM was 41gb. It looks too much low compared to training dreambooth on images that can reaches 15-30 it/s if I remember well. I was training hd with videos of 2 frames and 512x512, but maybe it's because I had just 10 of them? It doesn't matter the number of batch I put it don't speed up. I'll try today with more videos. How could I use GPU at maximum speed?
@sergiobr, Which service did you use?
I do use GCP
@seel-channel
I re-visited your idea and discovered why text encoder training wasn't working for me initially. The data isn't transferred properly to the text encoder because there isn't a temporal attention layer in the CLIP encoder.
You can actually train the model on single frames / images if you only allow forward passes on the spatial layers (ones like in traditional stable diffusion), and skip the temporal ones entirely. This isn't to be confused with keeping the temporal layers frozen.
Even if you freeze them, the data is still passing through the temporal layers and won't train properly. This in turn allows for fine tuning on about 13GB of VRAM with plausible results (with the text encoder frozen). With it unfrozen, it seems to overfit very quickly.
Once I test it thoroughly I'll update the repository.
@seel-channel
I re-visited your idea and discovered why text encoder training wasn't working for me initially. The data isn't transferred properly to the text encoder because there isn't a temporal attention layer in the CLIP encoder.
You can actually train the model on single frames / images if you only allow forward passes on the spatial layers (ones like in traditional stable diffusion), and skip the temporal ones entirely. This isn't to be confused with keeping the temporal layers frozen.
Even if you freeze them, the data is still passing through the temporal layers and won't train properly. This in turn allows for fine tuning on about 13GB of VRAM with plausible results (with the text encoder frozen). With it unfrozen, it seems to overfit very quickly.
Once I test it thoroughly I'll update the repository.
That's great!
@seel-channel Glad to see the great improvement! 8~16 frames is more than enough context for most cases, so training with at least <= 16GB VRAM should be the norm now with all attention layers unlocked. I'm not certain if the model can retain that much information without using a different attention mechanism all together for the temporal layers.
Convolution layers are still tricky to finetune, as well as well as getting around the increased VRAM usage. I'll update the default configs to better match different use cases.
I'll keep this issue open as I think there's a bit more room to play with here for 12GB VRAM users possibly.
I'm trying to finetune with 16 frames, using the tumblr TGIF dataset - I'm hoping to get rid of that tacky "shutterstock" watermark!
What's the best config at the moment, while using an a100? Thanks btw! this is great work. I'll make a PR for the tgif gif dataset once I've got it working properly
If you guys want to test the next version release (which includes image finetuning / text_encoder finetuning, VRAM optimizations), you can do so https://github.com/ExponentialML/Text-To-Video-Finetuning/pull/26.
Great. I'll test it. Thank you.
@ExponentialML About this
# The rate at which your frames are sampled. 'folder' samples FPS like, 'json' and 'single_video' act as frame skip.
sample_frame_rate: 30
Is there a reason why it can't be standardized? If I do understand right, if I don't wanna skip frames when using JSON file mode I need to set it to 0? Also, when using JSON, the code will only read the frames described with prompt in the file or it will use all frames for training?
@ExponentialML
About this
# The rate at which your frames are sampled. 'folder' samples FPS like, 'json' and 'single_video' act as frame skip. sample_frame_rate: 30
Is there a reason why it can't be standardized?
If I do understand right, if I don't wanna skip frames when using JSON file mode I need to set it to 0?
Also, when using JSON, the code will only read the frames described with prompt in the file or it will use all frames for training?
@ExponentialML About this
# The rate at which your frames are sampled. 'folder' samples FPS like, 'json' and 'single_video' act as frame skip. sample_frame_rate: 30
Is there a reason why it can't be standardized? If I do understand right, if I don't wanna skip frames when using JSON file mode I need to set it to 0? Also, when using JSON, the code will only read the frames described with prompt in the file or it will use all frames for training?
If you want to use all the frames you should set sample_frame_rate: 1, not zero. Also you need to describe on the frames_data.JSON every frame will be use. It's every frame, you need to describe all your frames
Thank you @seel-channel I think I got it now. Sometimes I have errors and cannot start training with messages regarding about size of tensors. I used to cut videos with the same number of frames also. Don't know if it's also needed.
@sergiobr Please comment on the PR so any conversions are easy to track, thanks!
I recommend just using the video dataset with captions for now using "folder"
only for quicker testing (you can test on small datasets of 5 or so).
Also as @seel-channel said, you can just set the n_sample_frames to "1". I may separate the parameters as they behave differently for each dataset.
I'll close this as I feel the VRAM optimizations are more than sufficient, especially with LoRA training. If this is still an issue, feel free to ping me fore a re-open, or start a discussion to better discuss optimizing.
I want to train it with n_sample_frames: 100. With 100 videos
I'm using a 3090 Ti and the max n_sample_frames is 24
Last question, Can Text-To-Video-Finetuning recreate this video (same amount of frames and mainteining the cameras)?
https://user-images.githubusercontent.com/65832922/227836408-eb86a27c-b359-4a5b-900a-7f181f795cf4.mp4