THUDM / CogVideo

text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
Apache License 2.0
9.18k stars 862 forks source link

diffusers finetune error #377

Closed syyxsxx closed 3 weeks ago

syyxsxx commented 1 month ago

System Info / 系統信息

h100 cuda-12.2

Information / 问题信息

Reproduction / 复现过程

单卡h100, 运行 finetune_single_rank.sh 其中去掉了--multi_gpu 跑官方的例子会报这个错误 accelerate设置使用和不使用deepspeed都一样

/usr/lib/python3/dist-packages/requests/__init__.py:89: RequestsDependencyWarning: urllib3 (2.2.3) or chardet (3.0.4) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
[2024-09-28 18:15:08,691] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/usr/lib/python3/dist-packages/requests/__init__.py:89: RequestsDependencyWarning: urllib3 (2.2.3) or chardet (3.0.4) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
[2024-09-28 18:15:12,968] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-28 18:15:14,125] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-09-28 18:15:14,126] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[W928 18:15:14.666303639 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
09/28/2024 18:15:14 - INFO - __main__ - Distributed environment: DEEPSPEED  Backend: nccl
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: bf16
ds_config: {'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'gradient_accumulation_steps': 1, 'zero_optimization': {'stage': 2, 'offload_optimizer': {'device': 'none', 'nvme_path': None}, 'offload_param': {'device': 'none', 'nvme_path': None}, 'stage3_gather_16bit_weights_on_model_save': False}, 'gradient_clipping': 1.0, 'steps_per_print': inf, 'bf16': {'enabled': True}, 'fp16': {'enabled': False}}

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Downloading shards: 100%|████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 5127.51it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████| 2/2 [00:19<00:00,  9.59s/it]
Fetching 2 files: 100%|██████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 8924.05it/s]
{'use_learned_positional_embeddings'} was not found in config. Values will be initialized to default values.
####################
False
[rank0]: Traceback (most recent call last):
[rank0]:   File "train_cogvideox_lora.py", line 1546, in <module>
[rank0]:     main(args)
[rank0]:   File "train_cogvideox_lora.py", line 1241, in main
[rank0]:     transformer, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
[rank0]:   File "/root/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1318, in prepare
[rank0]:     result = self._prepare_deepspeed(*args)
[rank0]:   File "/root/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1719, in _prepare_deepspeed
[rank0]:     raise ValueError(
[rank0]: ValueError: Either specify a scheduler in the config file or pass in the `lr_scheduler_callable` parameter when using `accelerate.utils.DummyScheduler`.
E0928 18:20:41.941045 140103679866688 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 15802) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/root/.local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/root/.local/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/root/.local/lib/python3.8/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
    deepspeed_launcher(args)
  File "/root/.local/lib/python3.8/site-packages/accelerate/commands/launch.py", line 852, in deepspeed_launcher
    distrib_run.run(args)
  File "/root/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/root/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train_cogvideox_lora.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-28_18:20:41
  host      : notebook-b36518d1-ff84-47d8-97b6-dad2416411bd-0.notebook-b36518d1-ff84-47d8-97b6-dad2416411bd.colossal-ai.svc.cluster.local
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 15802)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Expected behavior / 期待表现

希望大佬能帮忙解决这个问题

zRzRzRzRzRzRzR commented 1 month ago

Did you install the diffusers library using source code, and also, --multi_gpu must be kept, otherwise you need to add local_machine and set the gpu id

syyxsxx commented 1 month ago

Did you install the diffusers library using source code, and also, --multi_gpu must be kept, otherwise you need to add local_machine and set the gpu id

@zRzRzRzRzRzRzR i have installed the diffusers library by using source code, also i use CUDA_VISIBLE_DEVICES to set the gpu id, What do you mean about add local_machine ps thanks for your replay

trinath3 commented 1 month ago

i believe it has something to do with accelerate version.

foreverpiano commented 1 month ago

@syyxsxx does it work right now?

zRzRzRzRzRzRzR commented 1 month ago

can you try with accelerate with version 1.0.0

foreverpiano commented 1 month ago

@zRzRzRzRzRzRzR can you upload runable version of pip freeze?

syyxsxx commented 3 weeks ago

hi guys, as @zRzRzRzRzRzRzR say, i use CUDA_VISIBLE_DEVICES to set the gpu id and add --multi_gpu, i work for me thanks @zRzRzRzRzRzRzR i will close the issue