配置环境安装好后，运行预训练脚本时报错

sxk000 commented 4 months ago

首先感谢上海人工智能实验室及其成员对书生模型、代码框架、技术经验的分享！

用的机器是单机多卡，8*A800，centos7，cuda12.2.2，cudnn8.9.7.29，

Operating System: CentOS Linux 7 (Core) CPE OS Name: cpe:/o:centos:centos:7 Kernel: Linux 3.10.0-1160.el7.x86_64 Architecture: x86-64

cuda：cuda_12.2.2_535.104.05_linux.run cudnn：cudnn-linux-x86_64-8.9.7.29_cuda12-archive.tar

具体操作步骤如下： 1，pip install xtuner 安装时，自动安装的xtuner 0.1.14，其他自动安装的依赖torch2.2.1、nvidia-cudnn-cu12等 2，pip install deepspeed ,自动安装的版本： 0.13.5 3，pip install flash-attn，自动安装时没有安装成功，安装报错日志推荐的安装包地址：https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.6/flash_attn-2.5.6+cu121torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl 4，下载3对应的flash_attn-2.5.6+cu121torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl后，能正常安装，成功 5，运行预训练脚本命令：NPROC_PER_NODE=8 xtuner train internlm2_chat_20b_full_finetune_custom_dataset_e1.py --deepspeed deepspeed_zero3

安装的依赖包文件：requirements.txt 完整的报错日志文件：报错日志.log

部分报错日志信息展示如下：

Shanghai, the bustling metropolis of China, is not only known for its modern skyline and vibrant nightlife but also for its numerous scenic spots that offer a perfect blend of tradition, culture, and natural beauty. Here are five must-

03/06 09:44:30 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io
03/06 09:44:30 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future.
03/06 09:44:30 - mmengine - INFO - Checkpoints will be saved to /apply/app/llm/intern/work_dirs/internlm2_chat_20b_full_finetune_custom_dataset_e1.
/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/tmp/tmpzxt81eyz/main.c: In function ‘list_to_cuuint64_array’:
/tmp/tmpzxt81eyz/main.c:354:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
   for (Py_ssize_t i = 0; i < len; i++) {
/tmp/tmpzxt81eyz/main.c:354:3: note: use option -std=c99 or -std=gnu99 to compile your code
/tmp/tmpzxt81eyz/main.c: In function ‘list_to_cuuint32_array’:
/tmp/tmpzxt81eyz/main.c:365:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
   for (Py_ssize_t i = 0; i < len; i++) {
/tmp/tmpb9yhj06s/main.c: In function ‘list_to_cuuint64_array’:
/tmp/tmpyv0re5u3/main.c: In function ‘list_to_cuuint64_array’:
/tmp/tmpb9yhj06s/main.c:354:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
   for (Py_ssize_t i = 0; i < len; i++) {
/tmp/tmpyv0re5u3/main.c:354:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
   for (Py_ssize_t i = 0; i < len; i++) {
/tmp/tmpb9yhj06s/main.c:354:3: note: use option -std=c99 or -std=gnu99 to compile your code
/tmp/tmpyv0re5u3/main.c:354:3: note: use option -std=c99 or -std=gnu99 to compile your code
/tmp/tmpb9yhj06s/main.c: In function ‘list_to_cuuint32_array’:
/tmp/tmpyv0re5u3/main.c: In function ‘list_to_cuuint32_array’:
/tmp/tmpb9yhj06s/main.c:365:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
   for (Py_ssize_t i = 0; i < len; i++) {
/tmp/tmpyv0re5u3/main.c:365:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
   for (Py_ssize_t i = 0; i < len; i++) {
Traceback (most recent call last):
  File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/xtuner/tools/train.py", line 307, in <module>
    main()
  File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/xtuner/tools/train.py", line 303, in main
    runner.train()
  File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1200, in train
    model = self.train_loop.run()  # type: ignore
  File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/mmengine/runner/loops.py", line 286, in run
    self.run_iter(data_batch)
  File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/mmengine/runner/loops.py", line 309, in run_iter
Traceback (most recent call last):
      File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/xtuner/tools/train.py", line 307, in <module>
outputs = self.runner.model.train_step(
  File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 133, in train_step
    main()
  File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/xtuner/tools/train.py", line 303, in main
    losses = self._run_forward(data, mode='loss')
      File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 176, in _run_forward
    return self._call_impl(*args, **kwargs)  File "/root/.cache/huggingface/modules/transformers_modules/internlm2-chat-20b/modeling_internlm2.py", line 920, in custom_forward
  File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1561, in _call_impl
    self.utils = CudaUtils()
  File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/triton/runtime/driver.py", line 47, in __init__
        so = _build("cuda_utils", src_path, tmpdir)outputs = run_function(*args)
  File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/triton/runtime/driver.py", line 154, in _initialize_obj
    self._obj = self._init_fn()
  File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/triton/runtime/driver.py", line 187, in initialize_driver
    return CudaDriver()
  File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/triton/runtime/driver.py", line 77, in __init__
    self.utils = CudaUtils()
  File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/triton/runtime/driver.py", line 47, in __init__
    so = _build("cuda_utils", src_path, tmpdir)
  File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/triton/common/build.py", line 106, in _build
    ret = subprocess.check_call(cc_cmd)
  File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/subprocess.py", line 369, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmp7zd5mbk5/main.c', '-O3', '-I/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/triton/common/../third_party/cuda/include', '-I/root/miniconda3/envs/p310xtuner3/include/python3.10', '-I/tmp/tmp7zd5mbk5', '-shared', '-fPIC', '-lcuda', '-o', '/tmp/tmp7zd5mbk5/cuda_utils.cpython-310-x86_64-linux-gnu.so', '-L/lib64', '-L/lib', '-L/lib64', '-L/lib']' returned non-zero exit status 1.
[2024-03-06 09:44:35,390] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 5698) of binary: /root/miniconda3/envs/p310xtuner3/bin/python3.10
Traceback (most recent call last):
  File "/root/miniconda3/envs/p310xtuner3/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/xtuner/tools/train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-06_09:44:35
  host      : h3c-1st-mysteel-algorithm3
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 5699)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
。。。
[0]:
  time      : 2024-03-06_09:44:35
  host      : h3c-1st-mysteel-algorithm3
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 5698)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

请问这个问题应该如何解决呢？

谢谢！

HIT-cwh commented 4 months ago

很抱歉给您的使用带来不便。请降低 Triton 版本至 2.1.0 (pip install triton==2.1.0) 之后再尝试下呢？如果还有问题，请随时跟我们联系。

sxk000 commented 4 months ago

很抱歉给您的使用带来不便。请降低 Triton 版本至 2.1.0 (pip install triton==2.1.0) 之后再尝试下呢？如果还有问题，请随时跟我们联系。

版本降低后可以正常训练了！非常感谢！

pip install triton==2.1.0

Installing collected packages: triton Attempting uninstall: triton Found existing installation: triton 2.2.0 Uninstalling triton-2.2.0: Successfully uninstalled triton-2.2.0 ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. torch 2.2.1 requires triton==2.2.0; platform_system == "Linux" and platform_machine == "x86_64" and python_version < "3.12", but you have triton 2.1.0 which is incompatible. Successfully installed triton-2.1.0

虽然降低版本安装时，报不适合torch2.2.1的异常，但是运行脚本时正常，可以正常使用加速参数，没有再报错！

再次感谢！

sxk000 commented 3 months ago

@LZHgrla

你好！

我们的预训练脚本用的是：https://github.com/InternLM/xtuner/blob/main/examples/demo_data/pretrain/config.py

采用全参的方式微调，具体改动信息：


#######################################################################
#                      PART 2  Model & Tokenizer                      #
#######################################################################
tokenizer = dict(
    type=AutoTokenizer.from_pretrained,
    pretrained_model_name_or_path=pretrained_model_name_or_path,
    trust_remote_code=True,
    padding_side='right')

# model = dict(
#     type=SupervisedFinetune,
#     llm=dict(
#         type=AutoModelForCausalLM.from_pretrained,
#         pretrained_model_name_or_path=pretrained_model_name_or_path,
#         trust_remote_code=True,
#         torch_dtype=torch.float16,
#         quantization_config=dict(
#             type=BitsAndBytesConfig,
#             load_in_4bit=True,
#             load_in_8bit=False,
#             llm_int8_threshold=6.0,
#             llm_int8_has_fp16_weight=False,
#             bnb_4bit_compute_dtype=torch.float16,
#             bnb_4bit_use_double_quant=True,
#             bnb_4bit_quant_type='nf4')),
#     lora=dict(
#         type=LoraConfig,
#         r=64,
#         lora_alpha=16,
#         lora_dropout=0.1,
#         bias='none',
#         task_type='CAUSAL_LM'))

model = dict(
    type=SupervisedFinetune,
    llm=dict(
        type=AutoModelForCausalLM.from_pretrained,
        pretrained_model_name_or_path=pretrained_model_name_or_path,
        trust_remote_code=True,
        torch_dtype=torch.float16))

运行命令：NPROC_PER_NODE=8 xtuner train pt_config.py --deepspeed deepspeed_zero3

可以正常运行，但是运行过程中会突然终止，部分日志如下：


03/12 20:23:18 - mmengine - INFO - Iter(train) [  20/5004]  lr: 1.9999e-04  eta: 1 day, 0:45:34  time: 17.9112  data_time: 0.0149  memory: 69765  loss: 14.8530
[2024-03-12 20:24:28,593] [WARNING] [stage3.py:2069:step] 5 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
[2024-03-12 20:24:50,087] torch.distributed.elastic.agent.server.api: [WARNING] Received Signals.SIGHUP death signal, shutting down workers
[2024-03-12 20:24:50,087] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 35791 closing signal SIGHUP
[2024-03-12 20:24:50,087] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 35792 closing signal SIGHUP
[2024-03-12 20:24:50,088] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 35793 closing signal SIGHUP
[2024-03-12 20:24:50,088] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 35794 closing signal SIGHUP
Traceback (most recent call last):
  File "/root/miniconda3/envs/p310xtuner3/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    result = agent.run()
  File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
    result = f(*args, **kwargs)
  File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run
    result = self._invoke_run(role)
  File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 868, in _invoke_run
    time.sleep(monitor_interval)
  File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 35682 got signal: 1

完整的报错日志： 312pt.log

这个问题应该怎么解决呢？

非常感谢！

LZHgrla commented 3 months ago

@sxk000 有使用nohup吗？

sxk000 commented 3 months ago

@sxk000 有使用nohup吗？

相关 issues open-mmlab/mmrotate#210

是的：

nohup sh pt.sh > 312pt.log 2>&1 &

pt.sh 里面是运行命令：NPROC_PER_NODE=8 xtuner train pt_config.py --deepspeed deepspeed_zero3

我查询你上面的写的问题链接：

应该按照截图上再操作一步吗？

LZHgrla commented 3 months ago

@sxk000 建议是放弃nohup，试一下 tmux，会更好用的！直接 conda 安装就可以

conda install tmux

如果继续使用 nohup 的话，可以追踪一下上面 issue 所列的解决办法，进行尝试

InternLM / xtuner

配置环境安装好后，运行预训练脚本时报错 #447