THUDM / CogVideo

text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
Apache License 2.0
9.35k stars 881 forks source link

diffusers fine-tuning weird error occurred #541

Open AlphaNext opened 3 days ago

AlphaNext commented 3 days ago

System Info / 系統信息

Python 3.10.12, torch 2.4.0+cu121, cuda12.2, accelerate=1.1.1

Information / 问题信息

Reproduction / 复现过程

Log errors:

[W1122 13:45:52.579715884 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W1122 13:45:52.653884504 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W1122 13:45:52.665469534 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W1122 13:45:52.351059829 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
11/22/2024 13:45:54 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: bf16

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]11/22/2024 13:45:56 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 2
Local process index: 2
Device: cuda:2

Mixed precision type: bf16

11/22/2024 13:45:57 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 3
Local process index: 3
Device: cuda:3

Mixed precision type: bf16

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]11/22/2024 13:45:58 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 1
Local process index: 1
Device: cuda:1

Mixed precision type: bf16

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|█████     | 1/2 [00:19<00:19, 19.00s/it]
Loading checkpoint shards:  50%|█████     | 1/2 [00:22<00:22, 22.96s/it]
Loading checkpoint shards:  50%|█████     | 1/2 [00:20<00:20, 20.55s/it]
Loading checkpoint shards:  50%|█████     | 1/2 [00:20<00:20, 20.63s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:58<00:00, 31.11s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:58<00:00, 29.30s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [01:00<00:00, 31.75s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [01:00<00:00, 30.07s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [01:02<00:00, 32.75s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [01:02<00:00, 31.28s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [01:00<00:00, 31.77s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [01:00<00:00, 30.10s/it]
{'use_learned_positional_embeddings'} was not found in config. Values will be initialized to default values.
W1122 14:00:09.668000 140708476823360 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 31662 closing signal SIGTERM
W1122 14:00:09.669000 140708476823360 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 31663 closing signal SIGTERM
W1122 14:00:09.676000 140708476823360 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 31664 closing signal SIGTERM
E1122 14:00:26.050000 140708476823360 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -9) local_rank: 0 (pid: 31661) of binary: /opt/conda/bin/python3.10
Traceback (most recent call last):
  File "/opt/conda/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
    multi_gpu_launcher(args)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
    distrib_run.run(args)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
train_cogvideox_lora.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-11-22_14:00:09
  host      : xxxxxxxxxxxxxxx
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 31661)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 31661
======================================================

Expected behavior / 期待表现

solve it

zRzRzRzRzRzRzR commented 3 days ago

It seems that this is not due to the model, but a torch error. Are you doing distributed training?

AlphaNext commented 2 days ago

It seems that this is not due to the model, but a torch error. Are you doing distributed training?

Yes, single node with 4 GPUs, use scripts finetune_single_rank.sh and accelerate_config_machine_single.yaml

num_processes has been changed, with default type: distributed_type: DEEPSPEED

and run command with:

CUDA_VISIBLE_DEVICES="0,1,2,3" accelerate launch --config_file accelerate_config_machine_single.yaml --multi_gpu \