Closed glide-the closed 1 month ago
The same data can be fine-tuned using this diffusers example script https://github.com/huggingface/diffusers/tree/main/examples/cogvideo
#!/bin/bash
export MODEL_PATH="/mnt/ceph/develop/jiawei/model_checkpoint/CogVideoX-5b-I2V"
export CACHE_PATH="~/.cache"
export OUTPUT_PATH="/mnt/ceph/develop/jiawei/model_checkpoint/hf_cogvideox_imglora_test"
export VAL_IMAGE1="/mnt/ceph/develop/jiawei/diffusers_fork_zmf/examples/cogvideo/frame0.jpg"
export VAL_IMAGE2="/mnt/ceph/develop/jiawei/diffusers_fork_zmf/examples/cogvideo/frame30.jpg"
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
GPU_IDS="0,1,2,3,4,5,6,7"
WANDB_PROJECT=DiffUsers_CogVideoX_IMAGE_test
# if you are not using wth 8 gus, change `accelerate_config_machine_single.yaml` num_processes as your gpu number
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True WANDB_API_KEY= accelerate launch --main_process_port 29501 --gpu_ids $GPU_IDS --config_file /mnt/ceph/develop/jiawei/diffusers_fork_zmf/examples/cogvideo/lora_image_k7.yaml \
train_cogvideox_image_to_video_lora.py \
--gradient_checkpointing \
--pretrained_model_name_or_path $MODEL_PATH \
--cache_dir $CACHE_PATH \
--enable_tiling \
--enable_slicing \
--instance_data_root /mnt/ceph/develop/jiawei/lora_dataset/Dance-VideoGeneration-Dataset \
--caption_column prompts.txt \
--video_column videos.txt \
--id_token 奶糖, \
--validation_prompt "奶糖, A young girl in a white blouse and navy skirt stands in a sunlit park, smiling and holding up two fingers. She's surrounded by trees and a pathway, with dappled sunlight casting shadows. A young woman in a school uniform stands on a tree-lined path, surprised, with hands raised. In the park, a woman in a white blouse with a navy collar raises her hands in a playful 'V' shape, surrounded by lush greenery and sunlight.:::奶糖, A young woman with long dark hair tied into ponytails stands in a cozy, warmly lit room, smiling gently at the camera. She takes a selfie, her hair styled in loose waves, with a playful expression. The background is a plain, light-colored wall, emphasizing her features." \
--validation_images "$VAL_IMAGE1:::$VAL_IMAGE2" \
--validation_prompt_separator ::: \
--num_validation_videos 1 \
--validation_epochs 5 \
--seed 42 \
--rank 128 \
--lora_alpha 32 \
--mixed_precision bf16 \
--output_dir $OUTPUT_PATH \
--height 480 \
--width 720 \
--fps 8 \
--max_num_frames 49 \
--skip_frames_start 0 \
--skip_frames_end 0 \
--train_batch_size 1 \
--num_train_epochs 150 \
--checkpointing_steps 1000 \
--gradient_accumulation_steps 1 \
--learning_rate 1e-3 \
--lr_scheduler cosine_with_restarts \
--lr_warmup_steps 200 \
--lr_num_cycles 1 \
--enable_slicing \
--enable_tiling \
--gradient_checkpointing \
--optimizer AdamW \
--adam_beta1 0.9 \
--adam_beta2 0.95 \
--max_grad_norm 1.0 \
--resume_from_checkpoint latest \
--report_to wandb --tracker_name $WANDB_PROJECT
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
enable_cpu_affinity: false
gpu_ids: 0,1,2,3,5,6,7
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 7
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Does the error happen during validation/testing? If so, it might because of low nccl timeout. You could increase it during Accelerator initialization using --nccl_timeout 1800
. I don't think the timeout environment variables are considered in accelerate by taking a quick look at the codebase (so you need to set the timeout using InitProcessGroupKwargs()
).
No, this happens at the beginning of training, and I changed the code by 100000, but other than waiting for a long time at the beginning, nothing else changed
accelerator_project_config = ProjectConfiguration(project_dir=args.output_dir, logging_dir=logging_dir)
ddp_kwargs = DistributedDataParallelKwargs(find_unused_parameters=True)
init_process_group_kwargs = InitProcessGroupKwargs(backend="nccl", timeout=timedelta(seconds=1000000))
accelerator = Accelerator(
gradient_accumulation_steps=args.gradient_accumulation_steps,
mixed_precision=args.mixed_precision,
log_with=args.report_to,
project_config=accelerator_project_config,
kwargs_handlers=[ddp_kwargs, init_process_group_kwargs],
)
unning command: accelerate launch --gpu_ids 0,1,2,3,4,5,6,7 --config_file accelerate_configs/uncompiled_2.yaml training/cogvideox_text_to_video_lora.py --pretrained_model_name_or_path /mnt/ceph/develop/jiawei/model_checkpoint/CogVideoX-2b-base --data_root /mnt/ceph/develop/jiawei/lora_dataset/Dance-VideoGeneration-Dataset-encoded-2048 --caption_column prompts.txt --video_column videos.txt --load_tensors --video_reshape_mode center --height_buckets 480 --width_buckets 720 --frame_buckets 49 --dataloader_num_workers 8 --pin_memory --id_token "奶糖," --validation_prompt "奶糖, A young girl in a white blouse and navy skirt stands in a sunlit park, smiling and holding up two fingers. She's surrounded by trees and a pathway, with dappled sunlight casting shadows. A young woman in a school uniform stands on a tree-lined path, surprised, with hands raised. In the park, a woman in a white blouse with a navy collar raises her hands in a playful 'V' shape, surrounded by lush greenery and sunlight.:::奶糖, A young woman with long dark hair tied into ponytails stands in a cozy, warmly lit room, smiling gently at the camera. She takes a selfie, her hair styled in loose waves, with a playful expression. The background is a plain, light-colored wall, emphasizing her features." --validation_prompt_separator ::: --num_validation_videos 1 --validation_epochs 10 --seed 42 --rank 128 --lora_alpha 1 --mixed_precision bf16 --output_dir /mnt/ceph/develop/jiawei/model_checkpoint/cogvideox-lora_t2v_nccltest_optimizer_adamw__steps_318000__lr-schedule_cosine_with_restarts__learning-rate_1e-3/ --max_num_frames 49 --train_batch_size 1 --max_train_steps 318000 --checkpointing_steps 1000 --gradient_accumulation_steps 1 --gradient_checkpointing --learning_rate 1e-3 --lr_scheduler cosine_with_restarts --lr_warmup_steps 400 --lr_num_cycles 1 --enable_slicing --enable_tiling --optimizer adamw --beta1 0.9 --beta2 0.95 --weight_decay 0.001 --max_grad_norm 1.0 --allow_tf32 --resume_from_checkpoint latest --report_to wandb --tracker_name cogvideox-lora_t2v_nccltest_optimizer_adamw__steps_318000__lr-schedule_cosine_with_restarts__learning-rate_1e-3 --nccl_timeout 100000
[W1010 20:01:27.480336609 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W1010 20:01:27.484495631 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W1010 20:01:27.485932375 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W1010 20:01:27.486361502 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W1010 20:01:27.487214245 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W1010 20:01:27.487283135 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W1010 20:01:27.488506520 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W1010 20:01:27.489500649 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 2/2 [00:07<00:00, 3.63s/it]
{'use_learned_positional_embeddings'} was not found in config. Values will be initialized to default values.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 2/2 [00:07<00:00, 3.60s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 2/2 [00:07<00:00, 3.58s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 2/2 [00:07<00:00, 3.59s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 2/2 [00:07<00:00, 3.60s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 2/2 [00:07<00:00, 3.62s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 2/2 [00:07<00:00, 3.64s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 2/2 [00:07<00:00, 3.67s/it]
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: Currently logged in as: dmeck. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.18.3
wandb: Run data is saved locally in /mnt/ceph/develop/jiawei/cogvideox-distillation/wandb/run-20241010_200158-y81bdphb
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run deft-sunset-3
wandb: ⭐️ View project at https://wandb.ai/dmeck/cogvideox-lora_t2v_nccltest_optimizer_adamw__steps_318000__lr-schedule_cosine_with_restarts__learning-rate_1e-3
wandb: 🚀 View run at https://wandb.ai/dmeck/cogvideox-lora_t2v_nccltest_optimizer_adamw__steps_318000__lr-schedule_cosine_with_restarts__learning-rate_1e-3/runs/y81bdphb
===== Memory before training =====
memory_allocated=12.717 GB
max_memory_allocated=12.717 GB
max_memory_reserved=12.727 GB
***** Running training *****
Num trainable parameters = 58982400
Num examples = 30
Num epochs = 10600
Instantaneous batch size per device = 1
Total train batch size (w. parallel, distributed & accumulation) = 8
Gradient accumulation steps = 1
Total optimization steps = 318000
Checkpoint 'latest' does not exist. Starting a new training run.
Steps: 0%| | 0/318000 [00:00<?, ?it/s][rank4]:[W1010 20:02:05.963532430 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
/mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]
[rank7]:[W1010 20:02:05.010916401 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank0]:[W1010 20:02:05.031537133 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
/mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]
[rank5]:[W1010 20:02:05.055048320 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
/mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]
/mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]
[rank6]:[W1010 20:02:05.092652994 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
/mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]
[rank1]:[W1010 20:02:05.132175889 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
/mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]
[rank2]:[W1010 20:02:05.281313264 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
/mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]
[rank3]:[W1010 20:02:09.321965683 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
/mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]
[rank0]:[E1010 20:02:31.443078129 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank4]:[E1010 20:02:31.642538131 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 4] Exception (either an error or timeout) detected by watchdog at work: 24, last enqueued NCCL work: 24, last completed NCCL work: 23.
[rank4]:[E1010 20:02:32.405027945 ProcessGroupNCCL.cpp:621] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank4]:[E1010 20:02:32.405047125 ProcessGroupNCCL.cpp:627] [Rank 4] To avoid data inconsistency, we are taking the entire process down.
[rank4]:[E1010 20:02:32.405101238 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 4] Process group watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.20.5
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
NET/IB : Got completion from peer 10.76.228.50<26062> with status=5 opcode=129 len=47104 vendor err 244 (Recv) localGid fe80::966d:aeff:fec6:c6c2 remoteGidsfe80::966d:aeff:fec6:a34a
Exception raised from checkForNCCLErrorsInternal at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1892 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f19d7977f86 in /mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::shared_ptr<c10d::NCCLComm>&) + 0x220 (0x7f1989bca1e0 in /mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7c (0x7f1989bca42c in /mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::watchdogHandler() + 0x213 (0x7f1989bd1313 in /mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f1989bd371c in /mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xdbbf4 (0x7f19d920bbf4 in /mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/bin/../lib/libstdc++.so.6)
frame #6: <unknown function> + 0x94ac3 (0x7f19dcb8aac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #7: <unknown function> + 0x126850 (0x7f19dcc1c850 in /lib/x86_64-linux-gnu/libc.so.6)
W1010 20:02:32.799000 140505171003200 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1343064 closing signal SIGTERM
W1010 20:02:32.800000 140505171003200 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1343065 closing signal SIGTERM
W1010 20:02:32.801000 140505171003200 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1343066 closing signal SIGTERM
W1010 20:02:32.801000 140505171003200 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1343067 closing signal SIGTERM
W1010 20:02:32.802000 140505171003200 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1343069 closing signal SIGTERM
W1010 20:02:32.802000 140505171003200 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1343070 closing signal SIGTERM
W1010 20:02:32.803000 140505171003200 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1343071 closing signal SIGTERM
E1010 20:02:34.124000 140505171003200 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 4 (pid: 1343068) of binary: /mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/bin/python
Traceback (most recent call last):
File "/mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
multi_gpu_launcher(args)
File "/mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
distrib_run.run(args)
File "/mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/mnt/ceph/develop/jiawei/conda_env/cogvidex_distillation/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
training/cogvideox_text_to_video_lora.py FAILED
--------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-10-10_20:02:32
host : nm04-a800-node083
rank : 4 (local_rank: 4)
exitcode : -6 (pid: 1343068)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 1343068
========================================================
-------------------- Finished executing script --------------------
These environment variable annotations can be trained normally.
#!/bin/bash
export TORCH_LOGS="+dynamo,recompiles,graph_breaks"
# export TORCHDYNAMO_VERBOSE=1
# export WANDB_MODE="online"
# export NCCL_P2P_DISABLE=1
# export TORCH_NCCL_ENABLE_MONITORING=0
export WANDB_API_KEY=
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export NCCL_P2P_DISABLE=1
To be precise, comment this
GPU count | 8 GPU type | [NVIDIA A800-SXM4-80GB
enable nccl
train script
uncompiled_2.yaml