Open D-Mad opened 1 month ago
./train_text_to_video_lora.sh
Running command: accelerate launch --config_file accelerate_configs/uncompiled_2.yaml --gpu_ids 0,1 training/cogvideox_text_to_video_lora.py --pretrained_model_name_or_path THUDM/CogVideoX-5b --data_root /home/dev_ml/cogvideox-factory/video-dataset-disney --caption_column prompt.txt --video_column videos.txt --id_token BW_STYLE --height_buckets 480 --width_buckets 720 --frame_buckets 49 --dataloader_num_workers 8 --pin_memory --validation_prompt "BW_STYLE A black and white animated scene unfolds with an anthropomorphic goat surrounded by musical notes and symbols, suggesting a playful environment. Mickey Mouse appears, leaning forward in curiosity as the goat remains still. The goat then engages with Mickey, who bends down to converse or react. The dynamics shift as Mickey grabs the goat, potentially in surprise or playfulness, amidst a minimalistic background. The scene captures the evolving relationship between the two characters in a whimsical, animated setting, emphasizing their interactions and emotions:::BW_STYLE A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance" --validation_prompt_separator ::: --num_validation_videos 1 --validation_epochs 10 --seed 42 --rank 128 --lora_alpha 128 --mixed_precision bf16 --output_dir /home/dev_ml/cogvideox-factory/cogvideox-loraoptimizer_adam__steps_3000lr-schedule_cosine_with_restarts__learning-rate_1e-4/ --max_num_frames 49 --train_batch_size 1 --max_train_steps 3000 --checkpointing_steps 1000 --gradient_accumulation_steps 1 --gradient_checkpointing --learning_rate 1e-4 --lr_scheduler cosine_with_restarts --lr_warmup_steps 400 --lr_num_cycles 1 --enable_slicing --enable_tiling --optimizer adam --beta1 0.9 --beta2 0.95 --weight_decay 0.001 --max_grad_norm 1.0 --allow_tf32 --enable_model_cpu_offload --report_to wandb --nccl_timeout 1800
You set add_prefix_space
. The tokenizer needs to be converted from the slow tokenizers
Downloading shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 8858.09it/s]
Downloading shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 10686.12it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00, 5.47s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00, 5.46s/it]
Fetching 2 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 3232.60it/s]
{'use_learned_positional_embeddings'} was not found in config. Values will be initialized to default values.
Fetching 2 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 3766.78it/s]
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: Tracking run with wandb version 0.18.3
wandb: W&B syncing is set to offline
in this directory.
wandb: Run wandb online
or set WANDB_MODE=online to enable cloud syncing.
===== Memory before training =====
memory_allocated=20.153 GB
max_memory_allocated=20.153 GB
max_memory_reserved=20.514 GB
Running training
Num trainable parameters = 132120576
Num examples = 69
Num epochs = 44
Instantaneous batch size per device = 1
Total train batch size (w. parallel, distributed & accumulation) = 2
Gradient accumulation steps = 1
Total optimization steps = 3000
Steps: 0%| | 0/3000 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/dev_ml/cogvideox-factory/training/cogvideox_text_to_video_lora.py", line 924, in
Failures:
Have you run prepare_dataset.py
before running training? If you don't run it, it is not possible to train in under 24 GB. This is because you end up loading the text encoder and VAE, and VAE encode/decode can take additional ~5 GB on top of the models weights memory.
If you prepare the dataset by precomputing latents and prompt embeddings first, you should be able to reproduce the memory numbers we report.
Hi @a-r-r-o-w,
I also ran into the issue of OOM while finetuning the I2V model. However, I ran into OOM issue when running prepare_dataset.py on a 24G VRAM GPU. The GPU was allocated by 18.5G after I move the T5 text encoder to the device and I think it is not reasonable. I am wondering how can I fit it in my device? Thanks a lot!
Did it happen on the RAM or VRAM?
System Info / 系統信息
cuda11.8 x2 3090 linux ubuntu 22.04 lts pytorch2.4
Information / 问题信息
Reproduction / 复现过程
andb: You can sync this run to the cloud by running: wandb: wandb sync /home/dev_ml/cogvideox-factory/wandb/offline-run-20241011_154425-t76nveyh wandb: Find logs at: wandb/offline-run-20241011_154425-t76nveyh/logs [rank0]:I1011 15:44:57.956000 124307873129088 torch/_dynamo/utils.py:335] TorchDynamo compilation metrics: [rank0]:I1011 15:44:57.956000 124307873129088 torch/_dynamo/utils.py:335] Function, Runtimes (s) [rank0]:V1011 15:44:57.956000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats constrain_symbol_range: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0) [rank0]:V1011 15:44:57.956000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats evaluate_expr: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0) [rank0]:V1011 15:44:57.957000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _simplify_floor_div: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0) [rank0]:V1011 15:44:57.957000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _maybe_guard_rel: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0) [rank0]:V1011 15:44:57.957000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _find: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0) [rank0]:V1011 15:44:57.957000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats has_hint: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0) [rank0]:V1011 15:44:57.957000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats size_hint: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0) [rank0]:V1011 15:44:57.957000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats simplify: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0) [rank0]:V1011 15:44:57.957000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _update_divisible: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0) [rank0]:V1011 15:44:57.957000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats replace: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0) [rank0]:V1011 15:44:57.957000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _maybe_evaluate_static: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0) [rank0]:V1011 15:44:57.958000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats get_implications: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0) [rank0]:V1011 15:44:57.958000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats get_axioms: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0) [rank0]:V1011 15:44:57.958000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats safe_expand: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0) [rank0]:V1011 15:44:57.958000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats uninteresting_files: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0) W1011 15:45:01.515000 129677780091520 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 177223 closing signal SIGTERM E1011 15:45:02.282000 129677780091520 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 177222) of binary: /home/dev_ml/cogvideox-factory/venv/bin/python3.10 Traceback (most recent call last): File "/home/dev_ml/cogvideox-factory/venv/bin/accelerate", line 8, in
sys.exit(main())
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
multi_gpu_launcher(args)
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
distrib_run.run(args)
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
training/cogvideox_text_to_video_lora.py FAILED
Failures: