andb: You can sync this run to the cloud by running: wandb: wandb sync /home/dev_ml/cogvideox-factory/wandb/offline-run-20241011_154425-t76nveyh wandb: Find logs at: wandb/offline-run-20241011_154425-t76nveyh/logs [rank0]:I1011 15:44:57.956000 124307873129088 torch/_dynamo/utils.py:335] TorchDynamo compilation metrics: [rank0]:I1011 15:44:57.956000 124307873129088 torch/_dynamo/utils.py:335] Function, Runtimes (s) [rank0]:V1011 15:44:57.956000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats constrain_symbol_range: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0) [rank0]:V1011 15:44:57.956000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats evaluate_expr: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0) [rank0]:V1011 15:44:57.957000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _simplify_floor_div: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0) [rank0]:V1011 15:44:57.957000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _maybe_guard_rel: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0) [rank0]:V1011 15:44:57.957000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _find: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0) [rank0]:V1011 15:44:57.957000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats has_hint: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0) [rank0]:V1011 15:44:57.957000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats size_hint: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0) [rank0]:V1011 15:44:57.957000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats simplify: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0) [rank0]:V1011 15:44:57.957000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _update_divisible: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0) [rank0]:V1011 15:44:57.957000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats replace: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0) [rank0]:V1011 15:44:57.957000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _maybe_evaluate_static: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0) [rank0]:V1011 15:44:57.958000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats get_implications: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0) [rank0]:V1011 15:44:57.958000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats get_axioms: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0) [rank0]:V1011 15:44:57.958000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats safe_expand: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0) [rank0]:V1011 15:44:57.958000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats uninteresting_files: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0) W1011 15:45:01.515000 129677780091520 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 177223 closing signal SIGTERM E1011 15:45:02.282000 129677780091520 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 177222) of binary: /home/dev_ml/cogvideox-factory/venv/bin/python3.10 Traceback (most recent call last): File "/home/dev_ml/cogvideox-factory/venv/bin/accelerate", line 8, in sys.exit(main()) File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main args.func(args) File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1159, in launch_command multi_gpu_launcher(args) File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher distrib_run.run(args) File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

training/cogvideox_text_to_video_lora.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-10-11_15:45:01 host : W-ML-01 rank : 0 (local_rank: 0) exitcode : 1 (pid: 177222) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ### Expected behavior / 期待表现 how to fit it ?

D-Mad commented 1 month ago

./train_text_to_video_lora.sh Running command: accelerate launch --config_file accelerate_configs/uncompiled_2.yaml --gpu_ids 0,1 training/cogvideox_text_to_video_lora.py --pretrained_model_name_or_path THUDM/CogVideoX-5b --data_root /home/dev_ml/cogvideox-factory/video-dataset-disney --caption_column prompt.txt --video_column videos.txt --id_token BW_STYLE --height_buckets 480 --width_buckets 720 --frame_buckets 49 --dataloader_num_workers 8 --pin_memory --validation_prompt "BW_STYLE A black and white animated scene unfolds with an anthropomorphic goat surrounded by musical notes and symbols, suggesting a playful environment. Mickey Mouse appears, leaning forward in curiosity as the goat remains still. The goat then engages with Mickey, who bends down to converse or react. The dynamics shift as Mickey grabs the goat, potentially in surprise or playfulness, amidst a minimalistic background. The scene captures the evolving relationship between the two characters in a whimsical, animated setting, emphasizing their interactions and emotions:::BW_STYLE A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance" --validation_prompt_separator ::: --num_validation_videos 1 --validation_epochs 10 --seed 42 --rank 128 --lora_alpha 128 --mixed_precision bf16 --output_dir /home/dev_ml/cogvideox-factory/cogvideox-loraoptimizer_adam__steps_3000lr-schedule_cosine_with_restarts__learning-rate_1e-4/ --max_num_frames 49 --train_batch_size 1 --max_train_steps 3000 --checkpointing_steps 1000 --gradient_accumulation_steps 1 --gradient_checkpointing --learning_rate 1e-4 --lr_scheduler cosine_with_restarts --lr_warmup_steps 400 --lr_num_cycles 1 --enable_slicing --enable_tiling --optimizer adam --beta1 0.9 --beta2 0.95 --weight_decay 0.001 --max_grad_norm 1.0 --allow_tf32 --enable_model_cpu_offload --report_to wandb --nccl_timeout 1800 You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers Downloading shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 8858.09it/s] Downloading shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 10686.12it/s] Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00, 5.47s/it] Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00, 5.46s/it] Fetching 2 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 3232.60it/s] {'use_learned_positional_embeddings'} was not found in config. Values will be initialized to default values. Fetching 2 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 3766.78it/s] wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information. wandb: Tracking run with wandb version 0.18.3 wandb: W&B syncing is set to offline in this directory.
wandb: Run wandb online or set WANDB_MODE=online to enable cloud syncing. ===== Memory before training ===== memory_allocated=20.153 GB max_memory_allocated=20.153 GB max_memory_reserved=20.514 GB Running training Num trainable parameters = 132120576 Num examples = 69 Num epochs = 44 Instantaneous batch size per device = 1 Total train batch size (w. parallel, distributed & accumulation) = 2 Gradient accumulation steps = 1 Total optimization steps = 3000 Steps: 0%| | 0/3000 [00:00<?, ?it/s]Traceback (most recent call last): File "/home/dev_ml/cogvideox-factory/training/cogvideox_text_to_video_lora.py", line 924, in main(args) File "/home/dev_ml/cogvideox-factory/training/cogvideox_text_to_video_lora.py", line 636, in main latent_dist = vae.encode(videos).latent_dist File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper return method(self, *args, kwargs) File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 1222, in encode h = self._encode(x) File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 1181, in _encode return self.tiled_encode(x) File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 1348, in tiled_encode tile, conv_cache = self.encoder(tile, conv_cache=conv_cache) File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(args, kwargs) File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 799, in forward hidden_states, new_conv_cache[conv_cache_key] = down_block( File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, *kwargs) File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 431, in forward hidden_states, new_conv_cache[conv_cache_key] = resnet( File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, kwargs) File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 296, in forward hidden_states, new_conv_cache["conv1"] = self.conv1(hidden_states, conv_cache=conv_cache.get("conv1")) File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(args, kwargs) File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 134, in forward inputs = F.pad(inputs, padding_2d, mode="constant", value=0) File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/nn/functional.py", line 4552, in pad return torch._C._nn.pad(input, pad, mode, value) torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 214.00 MiB. GPU 0 has a total capacity of 23.68 GiB of which 139.50 MiB is free. Process 144849 has 337.11 MiB memory in use. Including non-PyTorch memory, this process has 22.83 GiB memory in use. Of the allocated memory 21.49 GiB is allocated by PyTorch, and 907.43 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) rank0: Traceback (most recent call last): rank0: File "/home/dev_ml/cogvideox-factory/training/cogvideox_text_to_video_lora.py", line 924, in

rank0: File "/home/dev_ml/cogvideox-factory/training/cogvideox_text_to_video_lora.py", line 636, in main rank0: latent_dist = vae.encode(videos).latent_dist rank0: File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper rank0: return method(self, args, kwargs) rank0: File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 1222, in encode rank0: h = self._encode(x) rank0: File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 1181, in _encode rank0: return self.tiled_encode(x) rank0: File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 1348, in tiled_encode rank0: tile, conv_cache = self.encoder(tile, conv_cache=conv_cache) rank0: File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl rank0: return self._call_impl(args, **kwargs) rank0: File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl rank0: return forward_call(args, kwargs) rank0: File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 799, in forward rank0: hidden_states, new_conv_cache[conv_cache_key] = down_block( rank0: File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl rank0: return self._call_impl(args, kwargs) rank0: File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl rank0: return forward_call(args, kwargs) rank0: File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 431, in forward rank0: hidden_states, new_conv_cache[conv_cache_key] = resnet( rank0: File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl rank0: return self._call_impl(args, kwargs) rank0: File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl rank0: return forward_call(args, kwargs) rank0: File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 296, in forward rank0: hidden_states, new_conv_cache["conv1"] = self.conv1(hidden_states, conv_cache=conv_cache.get("conv1")) rank0: File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl rank0: return self._call_impl(args, **kwargs) rank0: File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl rank0: return forward_call(*args, kwargs) rank0: File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 134, in forward rank0: inputs = F.pad(inputs, padding_2d, mode="constant", value=0) rank0: File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/nn/functional.py", line 4552, in pad rank0: return torch._C._nn.pad(input, pad, mode, value) rank0: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 214.00 MiB. GPU 0 has a total capacity of 23.68 GiB of which 139.50 MiB is free. Process 144849 has 337.11 MiB memory in use. Including non-PyTorch memory, this process has 22.83 GiB memory in use. Of the allocated memory 21.49 GiB is allocated by PyTorch, and 907.43 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) wandb: wandb: You can sync this run to the cloud by running: wandb: wandb sync /home/dev_ml/cogvideox-factory/wandb/offline-run-20241011_155115-v4hkz2vc wandb: Find logs at: wandb/offline-run-20241011_155115-v4hkz2vc/logs rank0:I1011 15:51:47.497000 139958949089920 torch/_dynamo/utils.py:335] TorchDynamo compilation metrics: rank0:I1011 15:51:47.497000 139958949089920 torch/_dynamo/utils.py:335] Function, Runtimes (s) rank0:V1011 15:51:47.498000 139958949089920 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats constrain_symbol_range: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0) rank0:V1011 15:51:47.498000 139958949089920 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats evaluate_expr: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0) rank0:V1011 15:51:47.498000 139958949089920 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _simplify_floor_div: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0) rank0:V1011 15:51:47.498000 139958949089920 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _maybe_guard_rel: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0) rank0:V1011 15:51:47.499000 139958949089920 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _find: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0) rank0:V1011 15:51:47.499000 139958949089920 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats has_hint: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0) rank0:V1011 15:51:47.499000 139958949089920 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats size_hint: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0) rank0:V1011 15:51:47.499000 139958949089920 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats simplify: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0) rank0:V1011 15:51:47.499000 139958949089920 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _update_divisible: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0) rank0:V1011 15:51:47.499000 139958949089920 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats replace: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0) rank0:V1011 15:51:47.499000 139958949089920 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _maybe_evaluate_static: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0) rank0:V1011 15:51:47.499000 139958949089920 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats get_implications: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0) rank0:V1011 15:51:47.499000 139958949089920 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats get_axioms: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0) rank0:V1011 15:51:47.499000 139958949089920 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats safe_expand: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0) rank0:V1011 15:51:47.500000 139958949089920 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats uninteresting_files: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0) W1011 15:51:49.320000 135937568105088 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 186206 closing signal SIGTERM E1011 15:51:50.136000 135937568105088 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 186205) of binary: /home/dev_ml/cogvideox-factory/venv/bin/python3.10 Traceback (most recent call last): File "/home/dev_ml/cogvideox-factory/venv/bin/accelerate", line 8, in sys.exit(main()) File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main args.func(args) File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1159, in launch_command multi_gpu_launcher(args) File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher distrib_run.run(args) File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

training/cogvideox_text_to_video_lora.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-10-11_15:51:49 host : W-ML-01 rank : 0 (local_rank: 0) exitcode : 1 (pid: 186205) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ add more information ternimal

a-r-r-o-w commented 1 month ago

Have you run prepare_dataset.py before running training? If you don't run it, it is not possible to train in under 24 GB. This is because you end up loading the text encoder and VAE, and VAE encode/decode can take additional ~5 GB on top of the models weights memory.

If you prepare the dataset by precomputing latents and prompt embeddings first, you should be able to reproduce the memory numbers we report.

TousakaNagio commented 1 month ago

Hi @a-r-r-o-w,

I also ran into the issue of OOM while finetuning the I2V model. However, I ran into OOM issue when running prepare_dataset.py on a 24G VRAM GPU. The GPU was allocated by 18.5G after I move the T5 text encoder to the device and I think it is not reasonable. I am wondering how can I fit it in my device? Thanks a lot!

sayakpaul commented 1 week ago

Did it happen on the RAM or VRAM?

a-r-r-o-w / cogvideox-factory

how to fix it ? training/cogvideox_text_to_video_lora.py FAILED #25

System Info / 系統信息

Information / 问题信息

Reproduction / 复现过程

training/cogvideox_text_to_video_lora.py FAILED

training/cogvideox_text_to_video_lora.py FAILED