Open Closertodeath opened 3 weeks ago
Same problem
@yunkchen I am too having the same problem
Please use DeepSpeed
@bubbliiiing one more question
What do you mean by training by 512 by 512 and that the image token length is 512x512 and 49 frames is 13,312.
I am assuming you mean 512 by 512 image and not latent? or do you mean the image or video resolution? h and w ? because given the patch size of the model p = 2 (512 / 2) (512 / 2) 49 not 13,312?
Does this setting allow me to train on any resolution?
enable_bucket is used to enable bucket training. When enabled, the model does not crop the images and videos at the center, but instead, it trains the entire images and videos after grouping them into buckets based on resolution.
I am guessing this setting more or less crops.
training_with_video_token_length specifies training the model according to token length. The token length for a video with dimensions 512x512 and 49 frames is 13,312.
At 512x512 resolution, the number of video frames is 49;
At 768x768 resolution, the number of video frames is 21;
At 1024x1024 resolution, the number of video frames is 9;
These resolutions combined with their corresponding lengths allow the model to generate videos of different sizes.
How does image-to-video training work? I think it's effectively just the same as training text-to-video and can it be conditioned on inputs?
Please use DeepSpeed
@bubbliiiing @yunkchen When I use deepspeed, the VRAM usage increases by 10gb as I noted in the OP. Is there something wrong with the default zero_stage2_config?
@Closertodeath I've encountered the same issue as well. Has it been resolved?
@log26 Nope.
Please use DeepSpeed
@bubbliiiing @yunkchen When I use deepspeed, the VRAM usage increases by 10gb as I noted in the OP. Is there something wrong with the default zero_stage2_config?
Can you show me your parameters? I haven't encountered it
@bubbliiiing one more question
What do you mean by training by 512 by 512 and that the image token length is 512x512 and 49 frames is 13,312.
I am assuming you mean 512 by 512 image and not latent? or do you mean the image or video resolution? h and w ? because given the patch size of the model p = 2 (512 / 2) (512 / 2) 49 not 13,312?
Does this setting allow me to train on any resolution?
enable_bucket is used to enable bucket training. When enabled, the model does not crop the images and videos at the center, but instead, it trains the entire images and videos after grouping them into buckets based on resolution.
I am guessing this setting more or less crops.
training_with_video_token_length specifies training the model according to token length. The token length for a video with dimensions 512x512 and 49 frames is 13,312. At 512x512 resolution, the number of video frames is 49; At 768x768 resolution, the number of video frames is 21; At 1024x1024 resolution, the number of video frames is 9; These resolutions combined with their corresponding lengths allow the model to generate videos of different sizes.
How does image-to-video training work? I think it's effectively just the same as training text-to-video and can it be conditioned on inputs?
I didn't understand the question: "How does image-to-video training work? I think it's effectively just the same as training text-to-video and can it be conditioned on inputs?"
The token length of 512x512x49 is calculated as follows (512 / 8 (vae) / 2 (patch)) (512 / 8 (vae) / 2 (patch)) (49 - 1 / 4 + 1 (vae)) = 13312
The token length of 512x512x49 is calculated as follows (512 / 8 (vae) / 2 (patch)) (512 / 8 (vae) / 2 (patch)) (49 - 1 / 4 + 1 (vae)) = 13312
What the 8 value from the vae ? and (49 - 1 / 4 + 1 (vae))
I didn't understand the question: "How does image-to-video training work? I think it's effectively just the same as training text-to-video and can it be conditioned on inputs?"
I guess I don't understand how image to video training works. Is the first image and last image given some noise and then the images in between denoised? For image to video?
@bubbliiiing
Oh I get it now thank you~~~
Hello, Currently, I've been facing issues with finetuning the 5b-inpaint model on an H100. Using deepspeed with your provided config will cause the trainer to require 90gb of VRAM. Not using deepspeed drops the VRAM usage by quite a bit, but the trainer is spiking by 10-12gb+ more every other step. Could you please help me debug the issue? I've tried different datasets, config settings and I find myself hitting a wall. So far this is the config that's gotten me the farthest with finetuning but I still oom around the 8th step. Is it not possible to finetune the 5b model on 80gb of VRAM?