Closed ImBadAtNames2019 closed 1 year ago
Im using the exact same dataset that i also used in the previous version of this repo, and it worked before no problem.
I added import glob on top of the dataset.py file and now i get this error
33 Attention layers using Scaled Dot Product Attention.
Lora successfully injected into UNet3DConditionModel.
Lora successfully injected into CLIPTextModel.
Traceback (most recent call last):
File "/content/Text-To-Video-Finetuning/train.py", line 915, in
Im running it on google colab, if that matters.
I tried adding from glob import glob instead of import glob on top of the dataset.py script and train.py script, and now i get this error instead
Lora successfully injected into UNet3DConditionModel.
Lora successfully injected into CLIPTextModel.
Caching Latents.: 0% 0/29 [00:00<?, ?it/s]
โญโโโโโโโโโโโโโโโโโโโโโ Traceback (most recent call last) โโโโโโโโโโโโโโโโโโโโโโโฎ
โ /content/Text-To-Video-Finetuning/train.py:915 in
Maybe its because im not using conda? I cant make conda work on google colab.
Nope, i have no idea what to do, i cant use this at all now.
Now i get this error:
2023-04-09 17:03:29.790548: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py:249: FutureWarning: logging_dir
is deprecated and will be removed in version 0.18.0 of ๐ค Accelerate. Use project_dir
instead.
warnings.warn(
04/09/2023 17:03:31 - INFO - main - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: fp16
{'variance_type'} was not found in config. Values will be initialized to default values.
/usr/local/lib/python3.9/dist-packages/transformers/modeling_utils.py:402: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
with safe_open(checkpoint_file, framework="pt") as f:
/usr/local/lib/python3.9/dist-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
/usr/local/lib/python3.9/dist-packages/torch/storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
storage = cls(wrap_storage=untyped_storage)
/usr/local/lib/python3.9/dist-packages/safetensors/torch.py:99: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
with safe_open(filename, framework="pt", device=device) as f:
{'mid_block_scale_factor', 'downsample_padding'} was not found in config. Values will be initialized to default values.
33 Attention layers using Scaled Dot Product Attention.
Lora successfully injected into UNet3DConditionModel.
Lora successfully injected into CLIPTextModel.
Caching Latents.: 0% 0/29 [00:01<?, ?it/s]
โญโโโโโโโโโโโโโโโโโโโโโ Traceback (most recent call last) โโโโโโโโโโโโโโโโโโโโโโโฎ
โ /content/Text-To-Video-Finetuning/train.py:914 in
I just pushed a quick fix. Can you check to see if it works?
I just pushed a quick fix. Can you check to see if it works?
Im testing now, give me one second.
I just pushed a quick fix. Can you check to see if it works?
Nope, now i get this error. Im using the default config file, i only changed the location of the model and the location of the folder containing the videos.
2023-04-09 19:42:03.849781: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py:249: FutureWarning: logging_dir
is deprecated and will be removed in version 0.18.0 of ๐ค Accelerate. Use project_dir
instead.
warnings.warn(
04/09/2023 19:42:05 - INFO - main - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: fp16
{'variance_type'} was not found in config. Values will be initialized to default values.
/usr/local/lib/python3.9/dist-packages/transformers/modeling_utils.py:402: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
with safe_open(checkpoint_file, framework="pt") as f:
/usr/local/lib/python3.9/dist-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
/usr/local/lib/python3.9/dist-packages/torch/storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
storage = cls(wrap_storage=untyped_storage)
/usr/local/lib/python3.9/dist-packages/safetensors/torch.py:99: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
with safe_open(filename, framework="pt", device=device) as f:
{'mid_block_scale_factor', 'downsample_padding'} was not found in config. Values will be initialized to default values.
33 Attention layers using Scaled Dot Product Attention.
Lora successfully injected into UNet3DConditionModel.
Lora successfully injected into CLIPTextModel.
Caching Latents.: 0% 0/29 [00:00<?, ?it/s]
โญโโโโโโโโโโโโโโโโโโโโโ Traceback (most recent call last) โโโโโโโโโโโโโโโโโโโโโโโฎ
โ /content/Text-To-Video-Finetuning/train.py:914 in
@ImBadAtNames2019 I pushed another fix. Try again please.
I apologize for the inconvenience as I'm able to test at the moment, but the following fix should work.
@ImBadAtNames2019 I pushed another fix. Try again please.
I apologize for the inconvenience as I'm able to test at the moment, but the following fix should work.
No worries.
Testing right now.
@ImBadAtNames2019 I pushed another fix. Try again please.
I apologize for the inconvenience as I'm able to test at the moment, but the following fix should work.
Nope, i even tried changing videos and i still get this errors. I get 2 different errors, sometimes the one i showed you above, and sometimes this one:
2023-04-09 20:49:11.225804: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py:249: FutureWarning: logging_dir
is deprecated and will be removed in version 0.18.0 of ๐ค Accelerate. Use project_dir
instead.
warnings.warn(
04/09/2023 20:49:13 - INFO - main - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: fp16
{'variance_type'} was not found in config. Values will be initialized to default values.
/usr/local/lib/python3.9/dist-packages/transformers/modeling_utils.py:402: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
with safe_open(checkpoint_file, framework="pt") as f:
/usr/local/lib/python3.9/dist-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
/usr/local/lib/python3.9/dist-packages/torch/storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
storage = cls(wrap_storage=untyped_storage)
/usr/local/lib/python3.9/dist-packages/safetensors/torch.py:99: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
with safe_open(filename, framework="pt", device=device) as f:
{'downsample_padding', 'mid_block_scale_factor'} was not found in config. Values will be initialized to default values.
33 Attention layers using Scaled Dot Product Attention.
Lora successfully injected into UNet3DConditionModel.
Lora successfully injected into CLIPTextModel.
Caching Latents.: 0% 0/2 [00:00<?, ?it/s]
โญโโโโโโโโโโโโโโโโโโโโโ Traceback (most recent call last) โโโโโโโโโโโโโโโโโโโโโโโฎ
โ /content/Text-To-Video-Finetuning/train.py:914 in
Maybe because im running it on google colab?
I'm not using Colab, but I encountered the following same(?) error: ZeroDivisionError: integer division or modulo by zero
In my case, I removed 'clip_path' items from the JSON file generated after preprocessing, and this allowed me to start the training successfully. I haven't finished the training yet, but it has progressed up to 1500 steps.
@ImBadAtNames2019 Should be fixed now.
@bruefire Could you please post the error log if possible?
@ExponentialML
No problem. Here is the log (sorry for the ugly path):
(venv) (base) PS E:\userdata\Documents\program\project\github\Text-To-Video-Finetuning> python train.py --config .\configs\v2\my_train_config.yaml
E:\userdata\Documents\program\project\github\Text-To-Video-Finetuning\venv\lib\site-packages\torchvision\io\image.py:13: UserWarning: Failed to load image Python extension: [WinError 127] ๆๅฎใใใใใญใทใผใธใฃใ่ฆใคใใใพใใใ
warn(f"Failed to load image Python extension: {e}")
WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
PyTorch 1.13.1+cu117 with CUDA 1107 (you have 2.1.0.dev20230409+cu117)
Python 3.9.13 (you have 3.9.13)
Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
Memory-efficient attention, SwiGLU, sparse and more won't be available.
Set XFORMERS_MORE_DETAILS=1 for more details
A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'
E:\userdata\Documents\program\project\github\Text-To-Video-Finetuning\venv\lib\site-packages\accelerate\accelerator.py:249: FutureWarning: logging_dir
is deprecated and will be removed in version 0.18.0 of ๐ค Accelerate. Use project_dir
instead.
warnings.warn(
E:\userdata\Documents\program\project\github\Text-To-Video-Finetuning\venv\lib\site-packages\accelerate\accelerator.py:359: UserWarning: log_with=tensorboard
was passed but no supported trackers are currently installed.
warnings.warn(f"log_with={log_with}
was passed but no supported trackers are currently installed.")
04/10/2023 07:02:22 - INFO - main - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: fp16
{'variance_type'} was not found in config. Values will be initialized to default values.
E:\userdata\Documents\program\project\github\Text-To-Video-Finetuning\venv\lib\site-packages\transformers\modeling_utils.py:402: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
with safe_open(checkpoint_file, framework="pt") as f:
{'downsample_padding', 'mid_block_scale_factor'} was not found in config. Values will be initialized to default values.
33 Attention layers using Scaled Dot Product Attention.
Lora successfully injected into UNet3DConditionModel.
Lora successfully injected into CLIPTextModel.
Loading JSON from ./json/anime-v2.json
Caching Latents.: 1%|โ | 40/4064 [00:07<11:48, 5.68it/s]
Traceback (most recent call last):
File "E:\userdata\Documents\program\project\github\Text-To-Video-Finetuning\train.py", line 914, in
@bruefire Interesting. Could you check to see if that specific video file is corrupt or plays at all? It seems everything goes well up until the 40th clip. If it is, I can implement some checks to ensure we can get past corrupt videos.
@ExponentialML Ok. but I have work to do, so I'll check it once I get back.
@bruefire Interesting. Could you check to see if that specific video file is corrupt or plays at all? It seems everything goes well up until the 40th clip. If it is, I can implement some checks to ensure we can get past corrupt videos.
Sorry i went off to sleep. I tested it again and got the same error (ZeroDivisionError: integer division or modulo by zero), but then i tried changing the video folder dataset for the second time, and this time it worked. So the problem now seems to be my dataset, but its the exact same dataset that i used in the previous version, and it worked. Im checking which specific videos are causing the problem.
I have no idea why the videos in my dataset are causing this problem, all of them are. I even tried processing the videos that work in handbrake and davinci (just like i did with the videos of my dataset that are causing this problem) and everything works just fine. I dont know, i will rebuild my dataset again from zero and lets see what happens.
Ok i kinda figured it out. My dataset is made of short gifs, mp4 format, 10 fps, some of them dont even last a second. Decreasing the fps value to 10 and setting n_sample_frames to 2 in the config file fixed the issue for me. But why do i have to set it so low? if i set it higher than 2 i get the same error. How is it sampling frames? is it sampling 1 frame every 10? or is it sampling 2 frames one after another?
I think the problem was caused because i set it to sample more frames than there actually are, but i didnt get this error in the previous version.
Im trying to sample more than 2 frames but it just wont let me. If i set the fps lower than 10 or the n_sample_frames higher than 2 i get this error: RecursionError: maximum recursion depth exceeded in comparison
Im losing my mind.
God its finally working, i just had to loop each video from the dataset till it reached 2 seconds length, then i set the fps to 10 and n sample frames to 8.
Nope, there is something wrong the way its sampling the frames, this is whats causing the problem. The movement of the output is completely wrong.
@bruefire Interesting. Could you check to see if that specific video file is corrupt or plays at all? It seems everything goes well up until the 40th clip. If it is, I can implement some checks to ensure we can get past corrupt videos.
I checked, and it seems that there are no damaged files, including the clipped videos. but I noticed that an error occurs when there is a 'data.frame_index=n-1' item with 'num_frames=n' in the JSON.
@bruefire @ImBadAtNames2019 sorry, I think this is happening due to some assumptions in the VideoFolder dataset.
I've made it throw a more clear error if the videos are too short and also guarded against dividing by zero for low frame-rate videos.
I'm not quite sure it's solved both of your issues (as it seems like decord
's VideoReader might be incorrectly reading the frame-rate of short GIFs?), but could you give this pull request a try and share your results?
@bruefire @ImBadAtNames2019 sorry, I think this is happening due to some assumptions in the VideoFolder dataset.
I've made it throw a more clear error if the videos are too short and also guarded against dividing by zero for low frame-rate videos.
I'm not quite sure it's solved both of your issues (as it seems like
decord
's VideoReader might be incorrectly reading the frame-rate of short GIFs?), but could you give this pull request a try and share your results?
I have no idea how to use a pull request. Can i just replace the lines of code modified in the "files changed" tab?
Yeah you can just replace those lines or use git:
git fetch origin pull/49/head:videofolder-fix
git checkout videofolder-fix
@JCBrouwer Thank you, but it didn't work well for me. I encountered the same ZeroDivisionError.
Yeah you can just replace those lines or use git:
git fetch origin pull/49/head:videofolder-fix git checkout videofolder-fix
Yes now its not giving me any error, but im still not sure if its sampling the frames correctly. For example, if the video is 10fps, and i set the fps value in the config file to 10, and the n_sample_frames to 4, its going to sample 4 frames in order, one after another (without skipping any frame) from a random part of the video? And if i do the same thing again but this time i set the fps value to 5, its going to still sample 4 frames, but this time skipping one frame in between every time, like one frame yes and one frame no, one frame yes and one frame no. Did i get this right?
@bruefire ahh you're using the JSON dataset, the fix won't affect that. Seems to me that you're somehow loading a video in that has a length of 0.
@ImBadAtNames2019, yes your description is what the video loader should be doing. The fact that you were getting this error, though, makes me a little suspicious:
โ 504 โ โ native_fps = vr.get_avg_fps() โ
โ 505 โ โ every_nth_frame = round(native_fps / self.fps) โ
โ 506 โ โ โ
โ โฑ 507 โ โ effective_length = len(vr) // every_nth_frame โ
โ 508 โ โ โ
โ 509 โ โ if effective_length < self.n_sample_frames: โ
โ 510 โ โ โ return self.getitem(random.randint(0, len(self.video_f โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
ZeroDivisionError: integer division or modulo by zero
Were you trying with an fps config that was higher than your 10 fps videos? Otherwise I think maybe vr.get_avg_fps()
might have been returning a wrong value.
In the beginning yes i had the fps set to the default value 24, and the videos from the dataset were at 10 fps. But then i changed the fps value to 10 and i was still getting problems, it wasnt letting me sample more than 2 frames. So then i increased the length of the videos to 2 seconds by looping them (they were gifs originally), after that i was able to sample 12 frames (more than that would give me errors), but the movement of the video output (after finetuning) was completely wrong. I wrote above everything that happened. Now im at 72% progress with the updated script, using 10fps and 8 n_sample_frames, lets see if i get better results.
I tested the fine tuned model and there is no difference compared to the stock one, like it didnt fine tune it at all. Here is my config file below, i dont know what im doing wrong, i didnt have this problems with the previous version. My dataset is a folder containing 29 mp4 videos (720px720p resolution, 10fps, 1-3 seconds length, not less than 1 second), each video has its own txt file (named like the video, same folder) containing the prompt. Im using a 40gb nvidia a100 rented on google colab. @JCBrouwer
# Pretrained diffusers model path.
pretrained_model_path: "/content/drive/MyDrive/models/model_scope_diffusers" #https://huggingface.co/damo-vilab/text-to-video-ms-1.7b/tree/main
# The folder where your training outputs will be placed.
output_dir: "./outputs"
# You can train multiple datasets at once. They will be joined together for training.
# Simply remove the line you don't need, or keep them all for mixed training.
# 'image': A folder of images and captions (.txt)
# 'folder': A folder a videos and captions (.txt)
# 'json': The JSON file created with automatic BLIP2 captions using https://github.com/ExponentialML/Video-BLIP2-Preprocessor
# 'single_video': A single video file.mp4 and text prompt
dataset_types:
- 'folder'
# Adds offset noise to training. See https://www.crosslabs.org/blog/diffusion-with-offset-noise
offset_noise_strength: 0.1
use_offset_noise: False
# When True, this extends all items in all enabled datasets to the highest length.
# For example, if you have 200 videos and 10 images, 10 images will be duplicated to the length of 200.
extend_dataset: False
# Caches the latents (Frames-Image -> VAE -> Latent) to a HDD or SDD.
# The latents will be saved under your training folder, and loaded automatically for training.
# This both saves memory and speeds up training and takes very little disk space.
cache_latents: True
# If you have cached latents set to `True` and have a directory of cached latents,
# you can skip the caching process and load previously saved ones.
cached_latent_dir: null #/path/to/cached_latents
# Train the text encoder. Leave at false to use LoRA only (Recommended).
train_text_encoder: False
# https://github.com/cloneofsimo/lora
# Use LoRA to train extra layers whilst saving memory. It trains both a LoRA & the model itself.
# This works slightly different than vanilla LoRA and DOES NOT save a separate file.
# It is simply used as a mechanism for saving memory by keeping layers frozen and training the residual.
# Use LoRA for the UNET model.
use_unet_lora: True
# Use LoRA for the Text Encoder.
use_text_lora: True
# The modules to use for LoRA. Different from 'trainable_modules'.
unet_lora_modules:
- "ResnetBlock2D"
# The modules to use for LoRA. Different from `trainable_text_modules`.
text_encoder_lora_modules:
- "CLIPEncoderLayer"
# The rank for LoRA training. With ModelScope, the maximum should be 1024.
# VRAM increases with higher rank, lower when decreased.
lora_rank: 16
# Training data parameters
train_data:
# The width and height in which you want your training data to be resized to.
width: 384
height: 384
# This will find the closest aspect ratio to your input width and height.
# For example, 512x512 width and height with a video of resolution 1280x720 will be resized to 512x256
use_bucketing: True
# The start frame index where your videos should start (Leave this at one for json and folder based training).
sample_start_idx: 1
# Used for 'folder'. The rate at which your frames are sampled. Does nothing for 'json' and 'single_video' dataset.
fps: 10
# For 'single_video' and 'json'. The number of frames to "step" (1,2,3,4) (frame_step=2) -> (1,3,5,7, ...).
frame_step: 5
# The number of frames to sample. The higher this number, the higher the VRAM (acts similar to batch size).
n_sample_frames: 8
# 'single_video'
single_video_path: "path/to/single/video.mp4"
# The prompt when using a a single video file
single_video_prompt: ""
# Fallback prompt if caption cannot be read. Enabled for 'image' and 'folder'.
fallback_prompt: ''
# 'folder'
path: "/content/drive/MyDrive/Datasets/dataset_1"
# 'json'
json_path: 'path/to/train/json/'
# 'image'
image_dir: 'path/to/image/directory'
# The prompt for all image files. Leave blank to use caption files (.txt)
single_img_prompt: ""
# Validation data parameters.
validation_data:
# A custom prompt that is different from your training dataset.
prompt: "anime girl dancing"
# Whether or not to sample preview during training (Requires more VRAM).
sample_preview: True
# The number of frames to sample during validation.
num_frames: 16
# Height and width of validation sample.
width: 384
height: 384
# Number of inference steps when generating the video.
num_inference_steps: 25
# CFG scale
guidance_scale: 9
# Learning rate for AdamW
learning_rate: 5e-6
# Weight decay. Higher = more regularization. Lower = closer to dataset.
adam_weight_decay: 1e-2
# Optimizer parameters for the UNET. Overrides base learning rate parameters.
extra_unet_params: null
#learning_rate: 1e-5
#adam_weight_decay: 1e-4
# Optimizer parameters for the Text Encoder. Overrides base learning rate parameters.
extra_text_encoder_params: null
#learning_rate: 5e-6
#adam_weight_decay: 0.2
# How many batches to train. Not to be confused with video frames.
train_batch_size: 1
# Maximum number of train steps. Model is saved after training.
max_train_steps: 2500
# Saves a model every nth step.
checkpointing_steps: 25000
# How many steps to do for validation if sample_preview is enabled.
validation_steps: 100
# Which modules we want to unfreeze for the UNET. Advanced usage.
trainable_modules:
# If you want to ignore temporal attention entirely, remove "attn1-2" and replace with ".attentions"
# This is for self attetion. Activates for spatial and temporal dimensions if n_sample_frames > 1
- "attn1"
# This is for cross attention (image & text data). Activates for spatial and temporal dimensions if n_sample_frames > 1
- "attn2"
# Convolution networks that hold temporal information. Activates for spatial and temporal dimensions if n_sample_frames > 1
- 'temp_conv'
# Which modules we want to unfreeze for the Text Encoder. Advanced usage.
trainable_text_modules:
- "all"
# Seed for validation.
seed: 64
# Whether or not we want to use mixed precision with accelerate
mixed_precision: "fp16"
# This seems to be incompatible at the moment.
use_8bit_adam: False
# Trades VRAM usage for speed. You lose roughly 20% of training speed, but save a lot of VRAM.
# If you need to save more VRAM, it can also be enabled for the text encoder, but reduces speed x2.
gradient_checkpointing: False
text_encoder_gradient_checkpointing: False
# Xformers must be installed for best memory savings and performance (< Pytorch 2.0)
enable_xformers_memory_efficient_attention: False
# Use scaled dot product attention (Only available with >= Torch 2.0)
enable_torch_2_attn: True
Ok @ImBadAtNames2019 I think the issues you were running into earlier should now more clearly fail with an error about the video files being to short. Judging by when you run into errors I'd hazard a guess that your shortest video is about 1.2 seconds long.
Regarding fine-tuning not being very effective, I'd suggest trying to raise your learning rate and training for longer than 2500 steps. For me a learning rate of 1e-5 and weight_decay of 0 starts to give clearly tuned results after ~5000 steps.
Regarding the error you're running into @bruefire, it's probably going wrong in this function when the BLIP2 frame is too close to the end of the video to sample a full n_sample_frames at the frame_step.
If so, any idea what a good fix would be @ExponentialML ?
Ok @ImBadAtNames2019 I think the issues you were running into earlier should now more clearly fail with an error about the video files being to short. Judging by when you run into errors I'd hazard a guess that your shortest video is about 1.2 seconds long.
Regarding fine-tuning not being very effective, I'd suggest trying to raise your learning rate and training for longer than 2500 steps. For me a learning rate of 1e-5 and weight_decay of 0 starts to give clearly tuned results after ~5000 steps.
I will try finetuning it with 5k steps but i doubt it will make any difference. The output of the model finetuned with 2500 steps is identical to the one of the stock model, are you sure my config file above is ok? Maybe i didnt configure it properly. In the previous version the output was completely different even with less than 2500 steps.
Regarding the error you're running into @bruefire, it's probably going wrong in this function when the BLIP2 frame is too close to the end of the video to sample a full n_sample_frames at the frame_step.
If so, any idea what a good fix would be @ExponentialML ?
It's tricky, but my recommendation would be to just return 1 frame. That way it will be trained on both the text encoder and attention layers, and the full frame videos will go to the temporal dimension. If all else fails, skip into the next batch or grab a fallback frame when the dataset is instantiated.
AttributeError: 'DDPMScheduler' object has no attribute 'prediction_type' Steps: 0% 0/10000 [00:00<?, ?it/s]
Thats it i give up, this new version is making me want to jump off the balcony. I will just wait for the VideoCrafter implementation.
After i run the script train_config.yaml i get this error below:
2023-04-09 13:40:38.702636: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT /usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py:249: FutureWarning:
logging_dir
is deprecated and will be removed in version 0.18.0 of ๐ค Accelerate. Useproject_dir
instead. warnings.warn( 04/09/2023 13:40:40 - INFO - main - Distributed environment: NO Num processes: 1 Process index: 0 Local process index: 0 Device: cudaMixed precision type: fp16
{'variance_type'} was not found in config. Values will be initialized to default values. /usr/local/lib/python3.9/dist-packages/transformers/modeling_utils.py:402: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() with safe_open(checkpoint_file, framework="pt") as f: /usr/local/lib/python3.9/dist-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() return self.fget.get(instance, owner)() /usr/local/lib/python3.9/dist-packages/torch/storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() storage = cls(wrap_storage=untyped_storage) /usr/local/lib/python3.9/dist-packages/safetensors/torch.py:99: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() with safe_open(filename, framework="pt", device=device) as f: {'mid_block_scale_factor', 'downsample_padding'} was not found in config. Values will be initialized to default values. 33 Attention layers using Scaled Dot Product Attention. Lora successfully injected into UNet3DConditionModel. Lora successfully injected into CLIPTextModel. Non-existant JSON path. Skipping. โญโโโโโโโโโโโโโโโโโโโโโ Traceback (most recent call last) โโโโโโโโโโโโโโโโโโโโโโโฎ โ /content/Text-To-Video-Finetuning/train.py:915 in โ
โ โ
โ 912 โ parser.add_argument("--config", type=str, default="./configs/my_co โ
โ 913 โ args = parser.parse_args() โ
โ 914 โ โ
โ โฑ 915 โ main(OmegaConf.load(args.config)) โ
โ 916 โ
โ โ
โ /content/Text-To-Video-Finetuning/train.py:582 in main โ
โ โ
โ 579 โ ) โ
โ 580 โ โ
โ 581 โ # Get the training dataset based on types (json, single_video, ima โ
โ โฑ 582 โ train_datasets = get_train_dataset(dataset_types, train_data, toke โ
โ 583 โ โ
โ 584 โ # Extend datasets that are less than the greatest one. This allows โ
โ 585 โ attrs = ['train_data', 'frames', 'image_dir', 'video_files'] โ
โ โ
โ /content/Text-To-Video-Finetuning/train.py:86 in get_train_dataset โ
โ โ
โ 83 โ for DataSet in [VideoJsonDataset, SingleVideoDataset, ImageDataset โ
โ 84 โ โ for dataset in dataset_types: โ
โ 85 โ โ โ if dataset == DataSet.getname(): โ
โ โฑ 86 โ โ โ โ train_datasets.append(DataSet(train_data, tokenizer= โ
โ 87 โ โ
โ 88 โ if len(train_datasets) > 0: โ
โ 89 โ โ return train_datasets โ
โ โ
โ /content/Text-To-Video-Finetuning/utils/dataset.py:487 in init โ
โ โ
โ 484 โ โ โ
โ 485 โ โ self.fallback_prompt = fallback_prompt โ
โ 486 โ โ โ
โ โฑ 487 โ โ self.video_files = glob(f"{path}/*.mp4") โ
โ 488 โ โ โ
โ 489 โ โ self.width = width โ
โ 490 โ โ self.height = height โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
NameError: name 'glob' is not defined