ncclRemoteError Multiple cards cannot be fine-tuned, communication error occurs

glide-the commented 1 month ago

GPU count | 8 GPU type | [NVIDIA A800-SXM4-80GB

enable nccl

export NCCL_DEBUG=INFO
export NCCL_IB_TIMEOUT=1000


nm04-a800-node083:1235788:1236546 [4] transport/net_ib.cc:1698 NCCL WARN NET/IB : Got completion from peer 10.76.228.50<46608> with status=12 opcode=129 len=47104 vendor err 129 (Recv) localGid fe80::966d:aeff:fec6:c6c2 remoteGidsfe80::966d:aeff:fec6:a34a
nm04-a800-node083:1235788:1236546 [4] NCCL INFO transport/net.cc:1298 -> 6
nm04-a800-node083:1235788:1236546 [4] NCCL INFO proxy.cc:694 -> 6
nm04-a800-node083:1235788:1236546 [4] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]

nm04-a800-node083:1235788:1236546 [4] transport/net_ib.cc:1698 NCCL WARN NET/IB : Got completion from peer 10.76.228.50<46608> with status=5 opcode=129 len=4 vendor err 244 (Recv) localGid fe80::966d:aeff:fec6:c6c2 remoteGidsfe80::966d:aeff:fec6:a34a
nm04-a800-node083:1235788:1236546 [4] NCCL INFO transport/net.cc:1298 -> 6
nm04-a800-node083:1235788:1236546 [4] NCCL INFO proxy.cc:694 -> 6
nm04-a800-node083:1235788:1236546 [4] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]

nm04-a800-node083:1235784:1236544 [0] transport/net_ib.cc:1698 NCCL WARN NET/IB : Got completion from peer 10.76.228.50<36502> with status=12 opcode=129 len=47104 vendor err 129 (Recv) localGid fe80::966d:aeff:fec6:a34a remoteGidsfe80::966d:aeff:fec6:c6c2
nm04-a800-node083:1235784:1236544 [0] NCCL INFO transport/net.cc:1298 -> 6
nm04-a800-node083:1235784:1236544 [0] NCCL INFO proxy.cc:694 -> 6
nm04-a800-node083:1235784:1236544 [0] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]

nm04-a800-node083:1235784:1236544 [0] transport/net_ib.cc:1698 NCCL WARN NET/IB : Got completion from peer 10.76.228.50<36502> with status=5 opcode=129 len=0 vendor err 244 (Recv) localGid fe80::966d:aeff:fec6:a34a remoteGidsfe80::966d:aeff:fec6:c6c2
nm04-a800-node083:1235784:1236544 [0] NCCL INFO transport/net.cc:1298 -> 6
nm04-a800-node083:1235784:1236544 [0] NCCL INFO proxy.cc:694 -> 6
nm04-a800-node083:1235784:1236544 [0] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]
[rank4]:[E1010 17:45:20.898486666 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 4] Exception (either an error or timeout) detected by watchdog at work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.


        nm04-a800-node083:1235784:1236331 [0] NCCL INFO comm 0x11b8cc420 rank 0 nranks 8 cudaDev 0 busId 23000 - Abort COMPLETE [rank0]:[E1010 17:45:21.480754615 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.           [rank0]:[E1010 17:45:21.480764067 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down.                                                                                                        [rank0]:[E1010 17:45:21.480804680 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.20.5            ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.                 Last error:                                                                                                             NET/IB : Got completion from peer 10.76.228.50<36502> with status=5 opcode=129 len=0 vendor err 244 (Recv) localGid fe80::966d:aeff:fec6:a34a remoteGidsfe80::966d:aeff:fec6:c6c2                                                               Exception raised from checkForNCCLErrorsInternal at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1892 (most recent call first):                                                                                                          frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f4559f77f86 in /mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/torch/lib/libc10.so)                                              frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::shared_ptr<c10d::NCCLComm>&) + 0x220 (0x7f450c1ca1e0 in /mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)     frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7c (0x7f450c1ca42c in /mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)                                  frame #3: c10d::ProcessGroupNCCL::watchdogHandler() + 0x213 (0x7f450c1d1313 in /mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)                                                frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f450c1d371c in /mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)                                               frame #5: <unknown function> + 0xdbbf4 (0x7f455b867bf4 in /mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/bin/../lib/libstdc++.so.6)                                                                                                   frame #6: <unknown function> + 0x94ac3 (0x7f455f1e6ac3 in /lib/x86_64-linux-gnu/libc.so.6)                              frame #7: <unknown function> + 0x126850 (0x7f455f278850 in /lib/x86_64-linux-gnu/libc.so.6)

train script


#!/bin/bash
export NCCL_DEBUG=INFO
export NCCL_IB_TIMEOUT=1000

export TORCH_LOGS="+dynamo,recompiles,graph_breaks"
export TORCHDYNAMO_VERBOSE=1
export WANDB_MODE="online"
export NCCL_P2P_DISABLE=1
export TORCH_NCCL_ENABLE_MONITORING=0
export WANDB_API_KEY=

GPU_IDS="0,1,2,3,4,5,6,7"

# Training Configurations
# Experiment with as many hyperparameters as you want!
LEARNING_RATES=("1e-3")
LR_SCHEDULES=("cosine_with_restarts")
OPTIMIZERS=("adamw")
MAX_TRAIN_STEPS=("318000") 

# Single GPU uncompiled training
ACCELERATE_CONFIG_FILE="accelerate_configs/uncompiled_2.yaml"

# Absolute path to where the data is located. Make sure to have read the README for how to prepare data.
# This example assumes you downloaded an already prepared dataset from HF CLI as follows:
#   huggingface-cli download --repo-type dataset Wild-Heart/Disney-VideoGeneration-Dataset --local-dir /path/to/my/datasets/disney-dataset
DATA_ROOT="/mnt/ceph/develop/jiawei/lora_dataset/Dance-VideoGeneration-Dataset-encoded-2048"
CAPTION_COLUMN="prompts.txt"
VIDEO_COLUMN="videos.txt"

# Launch experiments with different hyperparameters
for learning_rate in "${LEARNING_RATES[@]}"; do
  for lr_schedule in "${LR_SCHEDULES[@]}"; do
    for optimizer in "${OPTIMIZERS[@]}"; do
      for steps in "${MAX_TRAIN_STEPS[@]}"; do
        output_dir="/mnt/ceph/develop/jiawei/model_checkpoint/cogvideox-lora_t2v_train1000_optimizer_${optimizer}__steps_${steps}__lr-schedule_${lr_schedule}__learning-rate_${learning_rate}/"

        cmd="accelerate launch  --gpu_ids $GPU_IDS  --config_file $ACCELERATE_CONFIG_FILE training/cogvideox_text_to_video_lora.py \
          --pretrained_model_name_or_path /mnt/ceph/develop/jiawei/model_checkpoint/CogVideoX-2b-base \
          --load_tensors \
          --video_reshape_mode center \
          --data_root $DATA_ROOT \
          --caption_column $CAPTION_COLUMN \
          --video_column $VIDEO_COLUMN \
          --height_buckets 480 \
          --width_buckets 720 \
          --frame_buckets 49 \
          --dataloader_num_workers 8 \
          --pin_memory \
          --id_token \"奶糖,\" \
          --validation_prompt \"奶糖, A young girl in a white blouse and navy skirt stands in a sunlit park, smiling and holding up two fingers. She's surrounded by trees and a pathway, with dappled sunlight casting shadows. A young woman in a school uniform stands on a tree-lined path, surprised, with hands raised. In the park, a woman in a white blouse with a navy collar raises her hands in a playful 'V' shape, surrounded by lush greenery and sunlight.:::奶糖, A young woman with long dark hair tied into ponytails stands in a cozy, warmly lit room, smiling gently at the camera. She takes a selfie, her hair styled in loose waves, with a playful expression. The background is a plain, light-colored wall, emphasizing her features.\" \
          --validation_prompt_separator ::: \
          --num_validation_videos 1 \
          --validation_epochs 10 \
          --seed 42 \
          --rank 128 \
          --lora_alpha 1 \
          --mixed_precision bf16 \
          --output_dir $output_dir \
          --max_num_frames 49 \
          --train_batch_size 1 \
          --max_train_steps $steps \
          --checkpointing_steps 1000 \
          --gradient_accumulation_steps 1 \
          --gradient_checkpointing \
          --learning_rate $learning_rate \
          --lr_scheduler $lr_schedule \
          --lr_warmup_steps 400 \
          --lr_num_cycles 1 \
          --enable_slicing \
          --enable_tiling \
          --optimizer $optimizer \
          --beta1 0.9 \
          --beta2 0.95 \
          --weight_decay 0.001 \
          --max_grad_norm 1.0 \
          --allow_tf32 \
          --resume_from_checkpoint latest \
          --report_to wandb \
          --tracker_name cogvideox-lora_t2v_train1000_optimizer_${optimizer}__steps_${steps}__lr-schedule_${lr_schedule}__learning-rate_${learning_rate} \
          --nccl_timeout 10000"

        echo "Running command: $cmd"
        eval $cmd
        echo -ne "-------------------- Finished executing script --------------------\n\n"
      done
    done
  done
done

uncompiled_2.yaml


compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
enable_cpu_affinity: false
gpu_ids: 0,1,2,3,4,5,6,7
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

glide-the commented 1 month ago

The same data can be fine-tuned using this diffusers example script https://github.com/huggingface/diffusers/tree/main/examples/cogvideo

#!/bin/bash

export MODEL_PATH="/mnt/ceph/develop/jiawei/model_checkpoint/CogVideoX-5b-I2V"
export CACHE_PATH="~/.cache" 
export OUTPUT_PATH="/mnt/ceph/develop/jiawei/model_checkpoint/hf_cogvideox_imglora_test"
export VAL_IMAGE1="/mnt/ceph/develop/jiawei/diffusers_fork_zmf/examples/cogvideo/frame0.jpg"
export VAL_IMAGE2="/mnt/ceph/develop/jiawei/diffusers_fork_zmf/examples/cogvideo/frame30.jpg"
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
GPU_IDS="0,1,2,3,4,5,6,7"
WANDB_PROJECT=DiffUsers_CogVideoX_IMAGE_test

# if you are not using wth 8 gus, change `accelerate_config_machine_single.yaml` num_processes as your gpu number
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True WANDB_API_KEY= accelerate launch --main_process_port 29501 --gpu_ids $GPU_IDS   --config_file /mnt/ceph/develop/jiawei/diffusers_fork_zmf/examples/cogvideo/lora_image_k7.yaml  \
  train_cogvideox_image_to_video_lora.py \
  --gradient_checkpointing \
  --pretrained_model_name_or_path $MODEL_PATH \
  --cache_dir $CACHE_PATH \
  --enable_tiling \
  --enable_slicing \
  --instance_data_root /mnt/ceph/develop/jiawei/lora_dataset/Dance-VideoGeneration-Dataset \
  --caption_column prompts.txt \
  --video_column videos.txt \
  --id_token 奶糖, \
  --validation_prompt "奶糖, A young girl in a white blouse and navy skirt stands in a sunlit park, smiling and holding up two fingers. She's surrounded by trees and a pathway, with dappled sunlight casting shadows. A young woman in a school uniform stands on a tree-lined path, surprised, with hands raised. In the park, a woman in a white blouse with a navy collar raises her hands in a playful 'V' shape, surrounded by lush greenery and sunlight.:::奶糖, A young woman with long dark hair tied into ponytails stands in a cozy, warmly lit room, smiling gently at the camera. She takes a selfie, her hair styled in loose waves, with a playful expression. The background is a plain, light-colored wall, emphasizing her features." \
  --validation_images  "$VAL_IMAGE1:::$VAL_IMAGE2" \
  --validation_prompt_separator ::: \
  --num_validation_videos 1 \
  --validation_epochs 5 \
  --seed 42 \
  --rank 128 \
  --lora_alpha 32 \
  --mixed_precision bf16 \
  --output_dir $OUTPUT_PATH \
  --height 480 \
  --width 720 \
  --fps 8 \
  --max_num_frames 49 \
  --skip_frames_start 0 \
  --skip_frames_end 0 \
  --train_batch_size 1 \
  --num_train_epochs 150 \
  --checkpointing_steps 1000 \
  --gradient_accumulation_steps 1 \
  --learning_rate 1e-3 \
  --lr_scheduler cosine_with_restarts \
  --lr_warmup_steps 200 \
  --lr_num_cycles 1 \
  --enable_slicing \
  --enable_tiling \
  --gradient_checkpointing \
  --optimizer AdamW \
  --adam_beta1 0.9 \
  --adam_beta2 0.95 \
  --max_grad_norm 1.0 \
  --resume_from_checkpoint latest \
  --report_to wandb --tracker_name $WANDB_PROJECT

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
enable_cpu_affinity: false
gpu_ids: 0,1,2,3,5,6,7
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 7
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

a-r-r-o-w commented 1 month ago

Does the error happen during validation/testing? If so, it might because of low nccl timeout. You could increase it during Accelerator initialization using --nccl_timeout 1800. I don't think the timeout environment variables are considered in accelerate by taking a quick look at the codebase (so you need to set the timeout using InitProcessGroupKwargs()).

Might be relevant: this and this.

glide-the commented 1 month ago

No, this happens at the beginning of training, and I changed the code by 100000, but other than waiting for a long time at the beginning, nothing else changed

    accelerator_project_config = ProjectConfiguration(project_dir=args.output_dir, logging_dir=logging_dir)
    ddp_kwargs = DistributedDataParallelKwargs(find_unused_parameters=True)
    init_process_group_kwargs = InitProcessGroupKwargs(backend="nccl", timeout=timedelta(seconds=1000000))
    accelerator = Accelerator(
        gradient_accumulation_steps=args.gradient_accumulation_steps,
        mixed_precision=args.mixed_precision,
        log_with=args.report_to,
        project_config=accelerator_project_config,
        kwargs_handlers=[ddp_kwargs, init_process_group_kwargs],
    )

unning command: accelerate launch  --gpu_ids 0,1,2,3,4,5,6,7  --config_file accelerate_configs/uncompiled_2.yaml training/cogvideox_text_to_video_lora.py           --pretrained_model_name_or_path /mnt/ceph/develop/jiawei/model_checkpoint/CogVideoX-2b-base           --data_root /mnt/ceph/develop/jiawei/lora_dataset/Dance-VideoGeneration-Dataset-encoded-2048           --caption_column prompts.txt           --video_column videos.txt           --load_tensors           --video_reshape_mode center           --height_buckets 480           --width_buckets 720           --frame_buckets 49           --dataloader_num_workers 8           --pin_memory           --id_token "奶糖,"           --validation_prompt "奶糖, A young girl in a white blouse and navy skirt stands in a sunlit park, smiling and holding up two fingers. She's surrounded by trees and a pathway, with dappled sunlight casting shadows. A young woman in a school uniform stands on a tree-lined path, surprised, with hands raised. In the park, a woman in a white blouse with a navy collar raises her hands in a playful 'V' shape, surrounded by lush greenery and sunlight.:::奶糖, A young woman with long dark hair tied into ponytails stands in a cozy, warmly lit room, smiling gently at the camera. She takes a selfie, her hair styled in loose waves, with a playful expression. The background is a plain, light-colored wall, emphasizing her features."           --validation_prompt_separator :::           --num_validation_videos 1           --validation_epochs 10           --seed 42           --rank 128           --lora_alpha 1           --mixed_precision bf16           --output_dir /mnt/ceph/develop/jiawei/model_checkpoint/cogvideox-lora_t2v_nccltest_optimizer_adamw__steps_318000__lr-schedule_cosine_with_restarts__learning-rate_1e-3/           --max_num_frames 49           --train_batch_size 1           --max_train_steps 318000           --checkpointing_steps 1000           --gradient_accumulation_steps 1           --gradient_checkpointing           --learning_rate 1e-3           --lr_scheduler cosine_with_restarts           --lr_warmup_steps 400           --lr_num_cycles 1           --enable_slicing           --enable_tiling           --optimizer adamw           --beta1 0.9           --beta2 0.95           --weight_decay 0.001           --max_grad_norm 1.0           --allow_tf32           --resume_from_checkpoint latest           --report_to wandb           --tracker_name cogvideox-lora_t2v_nccltest_optimizer_adamw__steps_318000__lr-schedule_cosine_with_restarts__learning-rate_1e-3           --nccl_timeout 100000
[W1010 20:01:27.480336609 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W1010 20:01:27.484495631 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W1010 20:01:27.485932375 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W1010 20:01:27.486361502 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W1010 20:01:27.487214245 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W1010 20:01:27.487283135 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W1010 20:01:27.488506520 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W1010 20:01:27.489500649 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 2/2 [00:07<00:00,  3.63s/it]
{'use_learned_positional_embeddings'} was not found in config. Values will be initialized to default values.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 2/2 [00:07<00:00,  3.60s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 2/2 [00:07<00:00,  3.58s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 2/2 [00:07<00:00,  3.59s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 2/2 [00:07<00:00,  3.60s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 2/2 [00:07<00:00,  3.62s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 2/2 [00:07<00:00,  3.64s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 2/2 [00:07<00:00,  3.67s/it]
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: Currently logged in as: dmeck. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.18.3
wandb: Run data is saved locally in /mnt/ceph/develop/jiawei/cogvideox-distillation/wandb/run-20241010_200158-y81bdphb
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run deft-sunset-3
wandb: ⭐️ View project at https://wandb.ai/dmeck/cogvideox-lora_t2v_nccltest_optimizer_adamw__steps_318000__lr-schedule_cosine_with_restarts__learning-rate_1e-3
wandb: 🚀 View run at https://wandb.ai/dmeck/cogvideox-lora_t2v_nccltest_optimizer_adamw__steps_318000__lr-schedule_cosine_with_restarts__learning-rate_1e-3/runs/y81bdphb
===== Memory before training =====
memory_allocated=12.717 GB
max_memory_allocated=12.717 GB
max_memory_reserved=12.727 GB
***** Running training *****
  Num trainable parameters = 58982400
  Num examples = 30
  Num epochs = 10600
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient accumulation steps = 1
  Total optimization steps = 318000
Checkpoint 'latest' does not exist. Starting a new training run.
Steps:   0%|                                                                                 | 0/318000 [00:00<?, ?it/s][rank4]:[W1010 20:02:05.963532430 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
/mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
[rank7]:[W1010 20:02:05.010916401 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank0]:[W1010 20:02:05.031537133 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
/mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
[rank5]:[W1010 20:02:05.055048320 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
/mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
/mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
[rank6]:[W1010 20:02:05.092652994 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
/mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
[rank1]:[W1010 20:02:05.132175889 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
/mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
[rank2]:[W1010 20:02:05.281313264 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
/mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
[rank3]:[W1010 20:02:09.321965683 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
/mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
[rank0]:[E1010 20:02:31.443078129 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank4]:[E1010 20:02:31.642538131 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 4] Exception (either an error or timeout) detected by watchdog at work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank4]:[E1010 20:02:32.405027945 ProcessGroupNCCL.cpp:621] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank4]:[E1010 20:02:32.405047125 ProcessGroupNCCL.cpp:627] [Rank 4] To avoid data inconsistency, we are taking the entire process down.
[rank4]:[E1010 20:02:32.405101238 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 4] Process group watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.20.5
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
NET/IB : Got completion from peer 10.76.228.50<26062> with status=5 opcode=129 len=47104 vendor err 244 (Recv) localGid fe80::966d:aeff:fec6:c6c2 remoteGidsfe80::966d:aeff:fec6:a34a
Exception raised from checkForNCCLErrorsInternal at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1892 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f19d7977f86 in /mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::shared_ptr<c10d::NCCLComm>&) + 0x220 (0x7f1989bca1e0 in /mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7c (0x7f1989bca42c in /mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::watchdogHandler() + 0x213 (0x7f1989bd1313 in /mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f1989bd371c in /mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xdbbf4 (0x7f19d920bbf4 in /mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/bin/../lib/libstdc++.so.6)
frame #6: <unknown function> + 0x94ac3 (0x7f19dcb8aac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #7: <unknown function> + 0x126850 (0x7f19dcc1c850 in /lib/x86_64-linux-gnu/libc.so.6)

W1010 20:02:32.799000 140505171003200 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1343064 closing signal SIGTERM
W1010 20:02:32.800000 140505171003200 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1343065 closing signal SIGTERM
W1010 20:02:32.801000 140505171003200 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1343066 closing signal SIGTERM
W1010 20:02:32.801000 140505171003200 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1343067 closing signal SIGTERM
W1010 20:02:32.802000 140505171003200 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1343069 closing signal SIGTERM
W1010 20:02:32.802000 140505171003200 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1343070 closing signal SIGTERM
W1010 20:02:32.803000 140505171003200 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1343071 closing signal SIGTERM
E1010 20:02:34.124000 140505171003200 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 4 (pid: 1343068) of binary: /mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/bin/python
Traceback (most recent call last):
  File "/mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
    multi_gpu_launcher(args)
  File "/mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
    distrib_run.run(args)
  File "/mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
training/cogvideox_text_to_video_lora.py FAILED
--------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-10-10_20:02:32
  host      : nm04-a800-node083
  rank      : 4 (local_rank: 4)
  exitcode  : -6 (pid: 1343068)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 1343068
========================================================
-------------------- Finished executing script --------------------

glide-the commented 1 month ago

These environment variable annotations can be trained normally.

#!/bin/bash 
export TORCH_LOGS="+dynamo,recompiles,graph_breaks"
# export TORCHDYNAMO_VERBOSE=1
# export WANDB_MODE="online"
# export NCCL_P2P_DISABLE=1
# export TORCH_NCCL_ENABLE_MONITORING=0
export WANDB_API_KEY=
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

glide-the commented 1 month ago

export NCCL_P2P_DISABLE=1

To be precise, comment this

a-r-r-o-w / cogvideox-factory

ncclRemoteError Multiple cards cannot be fine-tuned, communication error occurs #16

export NCCL_P2P_DISABLE=1