Closed yvfengZhong closed 10 months ago
could you paste the full error stack trace?
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling transformers.utils.move_cache()
.
0it [00:00, ?it/s]
Initializing the conversion map
home/miniconda3/envs/torch200/lib/python3.10/site-packages/accelerate/accelerator.py:371: UserWarning: log_with=tensorboard
was passed but no supported trackers are currently installed.
warnings.warn(f"log_with={log_with}
was passed but no supported trackers are currently installed.")
01/09/2024 10:59:06 - INFO - main - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: fp16
{'variance_type'} was not found in config. Values will be initialized to default values. motion mask True, motion_strength True All model checkpoint weights were used when initializing UNet3DConditionModel.
All the weights of UNet3DConditionModel were initialized from the model checkpoint at output/latent/animate_anything_512_v1.02.
If your task is similar to the task the model of the checkpoint was trained on, you can already use UNet3DConditionModel for predictions without further training.
33 Attention layers using Scaled Dot Product Attention.
Loading JSON from home/Video-BLIP2-Preprocessor/train_data/my_videos.json
Non-existant JSON path. Skipping.
Non-existant JSON path. Skipping.
Could not process extra train datasets due to an error : [Errno 2] No such file or directory: '/webvid/webvid/data/40K.json'
01/09/2024 10:59:35 - INFO - main - Running training
01/09/2024 10:59:35 - INFO - main - Num examples = 260
01/09/2024 10:59:35 - INFO - main - Num Epochs = 152
01/09/2024 10:59:35 - INFO - main - Instantaneous batch size per device = 8
01/09/2024 10:59:35 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 8
01/09/2024 10:59:35 - INFO - main - Gradient Accumulation steps = 1
01/09/2024 10:59:35 - INFO - main - Total optimization steps = 5000
Steps: 0%| | 0/5000 [00:00<?, ?it/s]not trainable []
4100 params have been unfrozen for training.
Steps: 1%|▉ | 32/5000 [15:28<35:07:07, 25.45s/it, lr=5e-6, step_loss=0.197]Traceback (most recent call last):
File "home/animate-anything/train.py", line 1188, in
I extracted 10 videos from the WebVid10M dataset to create a demo dataset, and I processed them using the Video-BLIP2-Preprocessor. It's worth mentioning that I did not specify the clip_frame_data parameter during the Video-BLIP2-Preprocessor processing, but I did specify the video_blip parameter in the animate-anything module.
After processing, I've encountered unexpected behavior during training. I'm wondering if the issue might be related to the Video-BLIP2-Preprocessor processing or if there's something else I might be overlooking.
It seems that something is wrong in your video dataset. Maybe some training videos are corrupt. I suggest you to print the training video paths during training and find the corrupt video.
I have successfully pinpointed the underlying issue. In my scenario, I employed a restricted subset of the WebVid10M dataset, consisting of only 10 videos, to construct a demonstration dataset. Therefore, when reaching the 32nd iteration, the data loader attempted to read the last 2 samples of the dataset. This posed a problem since the shape of the uncond_input tensor was tied to train_batch_size (configured as 8 in the settings file), resulting in a tensor shape mismatch. https://github.com/alibaba/animate-anything/blob/43c7e1bb4ecc79f9477edb834b45d5eb5aedeedb/train.py#L783-L784
To rectify this issue, I configured the drop_last parameter of the train_dataloader to True. This resolution proved effective for my particular case. https://github.com/alibaba/animate-anything/blob/43c7e1bb4ecc79f9477edb834b45d5eb5aedeedb/train.py#L666-L671
Thank you for your contributions to the aigc community. I've encountered an issue while training using the train_mask_motion.yaml configuration file. I modified the training and testing datasets in the configuration file and initiated training using the command:
python train.py --config ./example/train_mask_motion.yaml
However, after training for 32 iterations, I encountered the following error in https://github.com/alibaba/animate-anything/blob/9e6098abcea894155eaab17c1f5573d0d11c3410/models/unet_3d_blocks.py#L41-L52:I find it puzzling why a tensor shape mismatch error is occurring midway through training. I would appreciate any insights or guidance you can provide to help me understand and resolve this issue.
Thank you once again for your assistance!