Closed sxk000 closed 3 months ago
很抱歉给您的使用带来不便。请降低 Triton 版本至 2.1.0 (pip install triton==2.1.0) 之后再尝试下呢? 如果还有问题,请随时跟我们联系。
很抱歉给您的使用带来不便。请降低 Triton 版本至 2.1.0 (pip install triton==2.1.0) 之后再尝试下呢? 如果还有问题,请随时跟我们联系。
版本降低后可以正常训练了!非常感谢!
pip install triton==2.1.0
Installing collected packages: triton Attempting uninstall: triton Found existing installation: triton 2.2.0 Uninstalling triton-2.2.0: Successfully uninstalled triton-2.2.0 ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. torch 2.2.1 requires triton==2.2.0; platform_system == "Linux" and platform_machine == "x86_64" and python_version < "3.12", but you have triton 2.1.0 which is incompatible. Successfully installed triton-2.1.0
虽然降低版本安装时,报不适合torch2.2.1的异常,但是运行脚本时正常,可以正常使用加速参数,没有再报错!
再次感谢!
@LZHgrla
你好!
我们的预训练脚本用的是:https://github.com/InternLM/xtuner/blob/main/examples/demo_data/pretrain/config.py
采用全参的方式微调,具体改动信息:
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
# model = dict(
# type=SupervisedFinetune,
# llm=dict(
# type=AutoModelForCausalLM.from_pretrained,
# pretrained_model_name_or_path=pretrained_model_name_or_path,
# trust_remote_code=True,
# torch_dtype=torch.float16,
# quantization_config=dict(
# type=BitsAndBytesConfig,
# load_in_4bit=True,
# load_in_8bit=False,
# llm_int8_threshold=6.0,
# llm_int8_has_fp16_weight=False,
# bnb_4bit_compute_dtype=torch.float16,
# bnb_4bit_use_double_quant=True,
# bnb_4bit_quant_type='nf4')),
# lora=dict(
# type=LoraConfig,
# r=64,
# lora_alpha=16,
# lora_dropout=0.1,
# bias='none',
# task_type='CAUSAL_LM'))
model = dict(
type=SupervisedFinetune,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16))
运行命令:NPROC_PER_NODE=8 xtuner train pt_config.py --deepspeed deepspeed_zero3
可以正常运行,但是运行过程中会突然终止,部分日志如下:
03/12 20:23:18 - mmengine - INFO - Iter(train) [ 20/5004] lr: 1.9999e-04 eta: 1 day, 0:45:34 time: 17.9112 data_time: 0.0149 memory: 69765 loss: 14.8530
[2024-03-12 20:24:28,593] [WARNING] [stage3.py:2069:step] 5 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
[2024-03-12 20:24:50,087] torch.distributed.elastic.agent.server.api: [WARNING] Received Signals.SIGHUP death signal, shutting down workers
[2024-03-12 20:24:50,087] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 35791 closing signal SIGHUP
[2024-03-12 20:24:50,087] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 35792 closing signal SIGHUP
[2024-03-12 20:24:50,088] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 35793 closing signal SIGHUP
[2024-03-12 20:24:50,088] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 35794 closing signal SIGHUP
Traceback (most recent call last):
File "/root/miniconda3/envs/p310xtuner3/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
result = agent.run()
File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run
result = self._invoke_run(role)
File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 868, in _invoke_run
time.sleep(monitor_interval)
File "/root/miniconda3/envs/p310xtuner3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 35682 got signal: 1
完整的报错日志: 312pt.log
这个问题应该怎么解决呢?
非常感谢!
@sxk000 有使用nohup吗?
@sxk000 有使用nohup吗?
相关 issues open-mmlab/mmrotate#210
是的:
nohup sh pt.sh > 312pt.log 2>&1 &
pt.sh 里面是运行命令:NPROC_PER_NODE=8 xtuner train pt_config.py --deepspeed deepspeed_zero3
我查询你上面的写的问题链接:
应该按照截图上再操作一步吗?
@sxk000 建议是放弃nohup,试一下 tmux,会更好用的! 直接 conda 安装就可以
conda install tmux
如果继续使用 nohup 的话,可以追踪一下上面 issue 所列的解决办法,进行尝试
首先感谢上海人工智能实验室及其成员对书生模型、代码框架、技术经验的分享!
用的机器是单机多卡,8*A800,centos7,cuda12.2.2,cudnn8.9.7.29,
Operating System: CentOS Linux 7 (Core) CPE OS Name: cpe:/o:centos:centos:7 Kernel: Linux 3.10.0-1160.el7.x86_64 Architecture: x86-64
cuda:cuda_12.2.2_535.104.05_linux.run cudnn:cudnn-linux-x86_64-8.9.7.29_cuda12-archive.tar
具体操作步骤如下: 1,pip install xtuner 安装时,自动安装的xtuner 0.1.14,其他自动安装的依赖torch2.2.1、nvidia-cudnn-cu12等 2,pip install deepspeed ,自动安装的版本: 0.13.5 3,pip install flash-attn,自动安装时没有安装成功 ,安装报错日志推荐的安装包地址:https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.6/flash_attn-2.5.6+cu121torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl 4,下载3对应的flash_attn-2.5.6+cu121torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl后,能正常安装,成功 5,运行预训练脚本命令:NPROC_PER_NODE=8 xtuner train internlm2_chat_20b_full_finetune_custom_dataset_e1.py --deepspeed deepspeed_zero3
安装的依赖包文件:requirements.txt 完整的报错日志文件:报错日志.log
部分报错日志信息展示如下:
请问这个问题应该如何解决呢?
谢谢!