Closed limllzu closed 3 months ago
你的数据集有多大呢?
你的数据集有多大呢?
你的数据集有多大呢?
数据集只有一条数据,是官方demo提供的 如下:
[
{
"id": "0",
"image": "path/image/image_0.jpg",
"conversations": [
{
"role": "user",
"content": "<image>\nHow many desserts are on the white plate?"
},
{
"role": "assistant",
"content": "There are three desserts on the white plate."
},
{
"role": "user",
"content": "What type of desserts are they?"
},
{
"role": "assistant",
"content": "The desserts are cakes with bananas and pecans on top. They share similarities with donuts, but the presence of bananas and pecans differentiates them."
},
{
"role": "user",
"content": "What is the setting of the image?"
},
{
"role": "assistant",
"content": "The image is set on a table top with a plate containing the three desserts."
}
]
}
]
我的环境是这样的 你可以参考一下 requirements.txt 我的linux内核版本是5.4.0,不知道是不是因为你的版本是3.10.0导致的
我的环境是这样的 你可以参考一下 requirements.txt 我的linux内核版本是5.4.0,不知道是不是因为你的版本是3.10.0导致的
感谢您的回答,我发现好像是我NCCL没有安装的原因,但是我安装以后又出现了新的问题,您能帮我看一下吗?谢谢
错误信息:
Traceback (most recent call last):
File "/LLM/openbmb/MiniCPM-V/finetune/finetune.py", line 708, in
输出信息: prepare trainer trainer ok [2024-06-14 16:05:12,128] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.3, git-hash=unknown, git-branch=unknown gpu009:30897:30897 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ens1f1 gpu009:30897:30897 [0] NCCL INFO NCCL_SOCKET_IFNAME set to ens1f1 gpu009:30897:30897 [0] bootstrap.cc:45 NCCL WARN Bootstrap : no socket interface found gpu009:30897:30897 [0] NCCL INFO init.cc:82 -> 3 gpu009:30897:30897 [0] NCCL INFO init.cc:101 -> 3
从日志信息看,我好像是没有正确设置网络接口,但是我使用ifconfig命令查找的时候是有ens1f1这个接口的,并且也可以ping通。 麻烦您帮我看一下,谢谢!!!
我的环境是这样的 你可以参考一下 requirements.txt 我的linux内核版本是5.4.0,不知道是不是因为你的版本是3.10.0导致的
感谢您的回答,我发现好像是我NCCL没有安装的原因,但是我安装以后又出现了新的问题,您能帮我看一下吗?谢谢 错误信息: Traceback (most recent call last): File "/LLM/openbmb/MiniCPM-V/finetune/finetune.py", line 708, in train() File "/LLM/openbmb/MiniCPM-V/finetune/finetune.py", line 690, in train trainer.train() File "/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train return inner_training_loop( File "/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 2045, in _inner_training_loop model, self.optimizer, self.lr_scheduler = self.accelerator.prepare( File "/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1284, in prepare result = self._prepare_deepspeed(*args) File "/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1751, in _preparedeepspeed engine, optimizer, , lr_scheduler = deepspeed.initialize(kwargs) File "/envs/llm/lib/python3.10/site-packages/deepspeed/init.py", line 181, in initialize engine = DeepSpeedEngine(args=args, File "/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 262, in init self._configure_distributed_model(model) File "/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1149, in _configure_distributed_model self._broadcast_model() File "/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1069, in _broadcast_model dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group) File "/envs/llm/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper return func(*args, *kwargs) File "/envs/llm/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 224, in broadcast return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) File "/envs/llm/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn return fn(args, kwargs) File "/envs/llm/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 199, in broadcast return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) File "/envs/llm/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper return func(*args, **kwargs) File "/envs/llm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1910, in broadcast work = group.broadcast([tensor], opts) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1251, internal error - please report this issue to the NCCL developers, NCCL version 2.18.6 ncclInternalError: Internal check failed. Last error: Bootstrap : no socket interface found [2024-06-14 16:05:19,306] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 30898 closing signal SIGTERM [2024-06-14 16:05:19,306] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 30899 closing signal SIGTERM [2024-06-14 16:05:19,307] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 30900 closing signal SIGTERM [2024-06-14 16:05:20,135] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 30897) of binary: /envs/llm/bin/python
输出信息: prepare trainer trainer ok [2024-06-14 16:05:12,128] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.3, git-hash=unknown, git-branch=unknown gpu009:30897:30897 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ens1f1 gpu009:30897:30897 [0] NCCL INFO NCCL_SOCKET_IFNAME set to ens1f1 gpu009:30897:30897 [0] bootstrap.cc:45 NCCL WARN Bootstrap : no socket interface found gpu009:30897:30897 [0] NCCL INFO init.cc:82 -> 3 gpu009:30897:30897 [0] NCCL INFO init.cc:101 -> 3
从日志信息看,我好像是没有正确设置网络接口,但是我使用ifconfig命令查找的时候是有ens1f1这个接口的,并且也可以ping通。 麻烦您帮我看一下,谢谢!!!
当我把网络接口切换到ib0的时候,它不会报错,但是根据NCCL日志信息,它还是处于挂起状态,没有训练 输出信息: prepare trainer Training dataset length: 1 Validation dataset length: 1 trainer ok [2024-06-14 16:59:42,697] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.3, git-hash=unknown, git-branch=unknown gpu009:47867:47867 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ib0 gpu009:47867:47867 [0] NCCL INFO NCCL_SOCKET_IFNAME set to ib0 gpu009:47867:47867 [0] NCCL INFO Bootstrap : Using ib0:11.11.8.9<0> gpu009:47867:47867 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory gpu009:47867:47867 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation gpu009:47868:47868 [1] NCCL INFO cudaDriverVersion 12000 gpu009:47869:47869 [2] NCCL INFO cudaDriverVersion 12000 gpu009:47868:47868 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ib0 gpu009:47868:47868 [1] NCCL INFO NCCL_SOCKET_IFNAME set to ib0 gpu009:47869:47869 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ib0 gpu009:47869:47869 [2] NCCL INFO NCCL_SOCKET_IFNAME set to ib0 gpu009:47870:47870 [3] NCCL INFO cudaDriverVersion 12000 gpu009:47868:47868 [1] NCCL INFO Bootstrap : Using ib0:11.11.8.9<0> gpu009:47869:47869 [2] NCCL INFO Bootstrap : Using ib0:11.11.8.9<0> gpu009:47869:47869 [2] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory gpu009:47868:47868 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory gpu009:47869:47869 [2] NCCL INFO NET/Plugin : No plugin found, using internal implementation gpu009:47868:47868 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation gpu009:47870:47870 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ib0 gpu009:47870:47870 [3] NCCL INFO NCCL_SOCKET_IFNAME set to ib0 gpu009:47869:47869 [2] NCCL INFO init.cc:1584 Cuda Host Alloc Size 4 pointer 0x7fc46e800000 gpu009:47868:47868 [1] NCCL INFO init.cc:1584 Cuda Host Alloc Size 4 pointer 0x7fc390800000 gpu009:47870:47870 [3] NCCL INFO Bootstrap : Using ib0:11.11.8.9<0> gpu009:47870:47870 [3] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory gpu009:47870:47870 [3] NCCL INFO NET/Plugin : No plugin found, using internal implementation gpu009:47870:47870 [3] NCCL INFO init.cc:1584 Cuda Host Alloc Size 4 pointer 0x7fbfc0800000 gpu009:47869:48590 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 1. gpu009:47869:48590 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ib0 gpu009:47869:48590 [2] NCCL INFO NET/Socket : Using [0]ib0:11.11.8.9<0> gpu009:47869:48590 [2] NCCL INFO Using network Socket gpu009:47868:48591 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1. gpu009:47868:48591 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ib0 gpu009:47867:47867 [0] NCCL INFO cudaDriverVersion 12000 NCCL version 2.18.6+cuda12.1 gpu009:47868:48591 [1] NCCL INFO NET/Socket : Using [0]ib0:11.11.8.9<0> gpu009:47868:48591 [1] NCCL INFO Using network Socket gpu009:47870:48592 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 1. gpu009:47867:47867 [0] NCCL INFO init.cc:1584 Cuda Host Alloc Size 4 pointer 0x7f2adc800000 gpu009:47870:48592 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ib0 gpu009:47870:48592 [3] NCCL INFO NET/Socket : Using [0]ib0:11.11.8.9<0> gpu009:47870:48592 [3] NCCL INFO Using network Socket gpu009:47867:48593 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1. gpu009:47867:48593 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ib0 gpu009:47867:48593 [0] NCCL INFO NET/Socket : Using [0]ib0:11.11.8.9<0> gpu009:47867:48593 [0] NCCL INFO Using network Socket gpu009:47867:48593 [0] NCCL INFO comm 0x7ef5c880 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 50000 commId 0x361b5540a6088610 - Init START gpu009:47870:48592 [3] NCCL INFO comm 0x68da1c00 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 9c000 commId 0x361b5540a6088610 - Init START gpu009:47869:48590 [2] NCCL INFO comm 0x69374b40 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 57000 commId 0x361b5540a6088610 - Init START gpu009:47868:48591 [1] NCCL INFO comm 0x68b107c0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 53000 commId 0x361b5540a6088610 - Init START gpu009:47870:48592 [3] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 'ib0' gpu009:47868:48591 [1] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 'ib0' gpu009:47869:48590 [2] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 'ib0' gpu009:47867:48593 [0] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 'ib0' gpu009:47870:48592 [3] NCCL INFO NCCL_P2P_LEVEL set by environment to NVL gpu009:47870:48592 [3] NCCL INFO === System : maxBw 24.0 totalBw 24.0 === gpu009:47870:48592 [3] NCCL INFO CPU/0 (1/1/2) gpu009:47870:48592 [3] NCCL INFO + PCI[24.0] - PCI/4B000 (1000c01010000000) gpu009:47870:48592 [3] NCCL INFO + PCI[24.0] - GPU/50000 (0) gpu009:47870:48592 [3] NCCL INFO + PCI[24.0] - GPU/53000 (1) gpu009:47870:48592 [3] NCCL INFO + PCI[24.0] - NIC/56000 gpu009:47870:48592 [3] NCCL INFO + PCI[24.0] - GPU/57000 (2) gpu009:47870:48592 [3] NCCL INFO + SYS[10.0] - CPU/1 gpu009:47870:48592 [3] NCCL INFO CPU/1 (1/1/2) gpu009:47870:48592 [3] NCCL INFO + PCI[24.0] - PCI/98000 (1000c01010000000) gpu009:47870:48592 [3] NCCL INFO + PCI[24.0] - GPU/9C000 (3) gpu009:47870:48592 [3] NCCL INFO + SYS[10.0] - CPU/0 gpu009:47870:48592 [3] NCCL INFO ========================================== gpu009:47870:48592 [3] NCCL INFO GPU/50000 :GPU/50000 (0/5000.000000/LOC) GPU/53000 (4/24.000000/PHB) GPU/57000 (4/24.000000/PHB) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS) gpu009:47868:48591 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to NVL gpu009:47870:48592 [3] NCCL INFO GPU/53000 :GPU/50000 (4/24.000000/PHB) GPU/53000 (0/5000.000000/LOC) GPU/57000 (4/24.000000/PHB) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS) gpu009:47870:48592 [3] NCCL INFO GPU/57000 :GPU/50000 (4/24.000000/PHB) GPU/53000 (4/24.000000/PHB) GPU/57000 (0/5000.000000/LOC) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS) gpu009:47870:48592 [3] NCCL INFO GPU/9C000 :GPU/50000 (5/10.000000/SYS) GPU/53000 (5/10.000000/SYS) GPU/57000 (5/10.000000/SYS) GPU/9C000 (0/5000.000000/LOC) CPU/0 (3/10.000000/SYS) CPU/1 (2/24.000000/PHB) gpu009:47870:48592 [3] NCCL INFO Setting affinity for GPU 3 to 3ff00000,0000003f,f0000000 gpu009:47870:48592 [3] NCCL INFO NVLS multicast support is not available on dev 3 gpu009:47869:48590 [2] NCCL INFO NCCL_P2P_LEVEL set by environment to NVL gpu009:47867:48593 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to NVL gpu009:47868:48591 [1] NCCL INFO === System : maxBw 24.0 totalBw 24.0 === gpu009:47868:48591 [1] NCCL INFO CPU/0 (1/1/2) gpu009:47868:48591 [1] NCCL INFO + PCI[24.0] - PCI/4B000 (1000c01010000000) gpu009:47868:48591 [1] NCCL INFO + PCI[24.0] - GPU/50000 (0) gpu009:47868:48591 [1] NCCL INFO + PCI[24.0] - GPU/53000 (1) gpu009:47868:48591 [1] NCCL INFO + PCI[24.0] - NIC/56000 gpu009:47868:48591 [1] NCCL INFO + PCI[24.0] - GPU/57000 (2) gpu009:47868:48591 [1] NCCL INFO + SYS[10.0] - CPU/1 gpu009:47868:48591 [1] NCCL INFO CPU/1 (1/1/2) gpu009:47868:48591 [1] NCCL INFO + PCI[24.0] - PCI/98000 (1000c01010000000) gpu009:47868:48591 [1] NCCL INFO + PCI[24.0] - GPU/9C000 (3) gpu009:47869:48590 [2] NCCL INFO === System : maxBw 24.0 totalBw 24.0 === gpu009:47868:48591 [1] NCCL INFO + SYS[10.0] - CPU/0 gpu009:47867:48593 [0] NCCL INFO === System : maxBw 24.0 totalBw 24.0 === gpu009:47869:48590 [2] NCCL INFO CPU/0 (1/1/2) gpu009:47868:48591 [1] NCCL INFO ========================================== gpu009:47867:48593 [0] NCCL INFO CPU/0 (1/1/2) gpu009:47869:48590 [2] NCCL INFO + PCI[24.0] - PCI/4B000 (1000c01010000000) gpu009:47868:48591 [1] NCCL INFO GPU/50000 :GPU/50000 (0/5000.000000/LOC) GPU/53000 (4/24.000000/PHB) GPU/57000 (4/24.000000/PHB) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS) gpu009:47867:48593 [0] NCCL INFO + PCI[24.0] - PCI/4B000 (1000c01010000000) gpu009:47869:48590 [2] NCCL INFO + PCI[24.0] - GPU/50000 (0) gpu009:47868:48591 [1] NCCL INFO GPU/53000 :GPU/50000 (4/24.000000/PHB) GPU/53000 (0/5000.000000/LOC) GPU/57000 (4/24.000000/PHB) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS) gpu009:47867:48593 [0] NCCL INFO + PCI[24.0] - GPU/50000 (0) gpu009:47869:48590 [2] NCCL INFO + PCI[24.0] - GPU/53000 (1) gpu009:47868:48591 [1] NCCL INFO GPU/57000 :GPU/50000 (4/24.000000/PHB) GPU/53000 (4/24.000000/PHB) GPU/57000 (0/5000.000000/LOC) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS) gpu009:47867:48593 [0] NCCL INFO + PCI[24.0] - GPU/53000 (1) gpu009:47869:48590 [2] NCCL INFO + PCI[24.0] - NIC/56000 gpu009:47868:48591 [1] NCCL INFO GPU/9C000 :GPU/50000 (5/10.000000/SYS) GPU/53000 (5/10.000000/SYS) GPU/57000 (5/10.000000/SYS) GPU/9C000 (0/5000.000000/LOC) CPU/0 (3/10.000000/SYS) CPU/1 (2/24.000000/PHB) gpu009:47867:48593 [0] NCCL INFO + PCI[24.0] - NIC/56000 gpu009:47869:48590 [2] NCCL INFO + PCI[24.0] - GPU/57000 (2) gpu009:47870:48592 [3] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 7.000000/7.000000, type SYS/PIX, sameChannels 1 gpu009:47868:48591 [1] NCCL INFO Setting affinity for GPU 1 to 03ff,00000000,0003ff00 gpu009:47867:48593 [0] NCCL INFO + PCI[24.0] - GPU/57000 (2) gpu009:47869:48590 [2] NCCL INFO + SYS[10.0] - CPU/1 gpu009:47870:48592 [3] NCCL INFO 0 : GPU/0 GPU/1 GPU/2 GPU/3 gpu009:47868:48591 [1] NCCL INFO NVLS multicast support is not available on dev 1 gpu009:47867:48593 [0] NCCL INFO + SYS[10.0] - CPU/1 gpu009:47869:48590 [2] NCCL INFO CPU/1 (1/1/2) gpu009:47867:48593 [0] NCCL INFO CPU/1 (1/1/2) gpu009:47869:48590 [2] NCCL INFO + PCI[24.0] - PCI/98000 (1000c01010000000) gpu009:47867:48593 [0] NCCL INFO + PCI[24.0] - PCI/98000 (1000c01010000000) gpu009:47870:48592 [3] NCCL INFO Pattern 1, crossNic 0, nChannels 1, bw 10.000000/10.000000, type SYS/PIX, sameChannels 1 gpu009:47869:48590 [2] NCCL INFO + PCI[24.0] - GPU/9C000 (3) gpu009:47867:48593 [0] NCCL INFO + PCI[24.0] - GPU/9C000 (3) gpu009:47870:48592 [3] NCCL INFO 0 : GPU/0 GPU/1 GPU/3 GPU/2 gpu009:47869:48590 [2] NCCL INFO + SYS[10.0] - CPU/0 gpu009:47867:48593 [0] NCCL INFO + SYS[10.0] - CPU/0 gpu009:47869:48590 [2] NCCL INFO ========================================== gpu009:47867:48593 [0] NCCL INFO ========================================== gpu009:47869:48590 [2] NCCL INFO GPU/50000 :GPU/50000 (0/5000.000000/LOC) GPU/53000 (4/24.000000/PHB) GPU/57000 (4/24.000000/PHB) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS) gpu009:47867:48593 [0] NCCL INFO GPU/50000 :GPU/50000 (0/5000.000000/LOC) GPU/53000 (4/24.000000/PHB) GPU/57000 (4/24.000000/PHB) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS) gpu009:47869:48590 [2] NCCL INFO GPU/53000 :GPU/50000 (4/24.000000/PHB) GPU/53000 (0/5000.000000/LOC) GPU/57000 (4/24.000000/PHB) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS) gpu009:47867:48593 [0] NCCL INFO GPU/53000 :GPU/50000 (4/24.000000/PHB) GPU/53000 (0/5000.000000/LOC) GPU/57000 (4/24.000000/PHB) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS) gpu009:47869:48590 [2] NCCL INFO GPU/57000 :GPU/50000 (4/24.000000/PHB) GPU/53000 (4/24.000000/PHB) GPU/57000 (0/5000.000000/LOC) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS) gpu009:47867:48593 [0] NCCL INFO GPU/57000 :GPU/50000 (4/24.000000/PHB) GPU/53000 (4/24.000000/PHB) GPU/57000 (0/5000.000000/LOC) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS) gpu009:47869:48590 [2] NCCL INFO GPU/9C000 :GPU/50000 (5/10.000000/SYS) GPU/53000 (5/10.000000/SYS) GPU/57000 (5/10.000000/SYS) GPU/9C000 (0/5000.000000/LOC) CPU/0 (3/10.000000/SYS) CPU/1 (2/24.000000/PHB) gpu009:47867:48593 [0] NCCL INFO GPU/9C000 :GPU/50000 (5/10.000000/SYS) GPU/53000 (5/10.000000/SYS) GPU/57000 (5/10.000000/SYS) GPU/9C000 (0/5000.000000/LOC) CPU/0 (3/10.000000/SYS) CPU/1 (2/24.000000/PHB) gpu009:47869:48590 [2] NCCL INFO Setting affinity for GPU 2 to 03ff,00000000,0003ff00 gpu009:47867:48593 [0] NCCL INFO Setting affinity for GPU 0 to 03ff,00000000,0003ff00 gpu009:47869:48590 [2] NCCL INFO NVLS multicast support is not available on dev 2 gpu009:47867:48593 [0] NCCL INFO NVLS multicast support is not available on dev 0 gpu009:47868:48591 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 7.000000/7.000000, type SYS/PIX, sameChannels 1 gpu009:47868:48591 [1] NCCL INFO 0 : GPU/0 GPU/1 GPU/2 GPU/3 gpu009:47868:48591 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 1, bw 10.000000/10.000000, type SYS/PIX, sameChannels 1 gpu009:47868:48591 [1] NCCL INFO 0 : GPU/0 GPU/1 GPU/3 GPU/2 gpu009:47869:48590 [2] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 7.000000/7.000000, type SYS/PIX, sameChannels 1 gpu009:47869:48590 [2] NCCL INFO 0 : GPU/0 GPU/1 GPU/2 GPU/3 gpu009:47867:48593 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 7.000000/7.000000, type SYS/PIX, sameChannels 1 gpu009:47867:48593 [0] NCCL INFO 0 : GPU/0 GPU/1 GPU/2 GPU/3 gpu009:47869:48590 [2] NCCL INFO Pattern 1, crossNic 0, nChannels 1, bw 10.000000/10.000000, type SYS/PIX, sameChannels 1 gpu009:47869:48590 [2] NCCL INFO 0 : GPU/0 GPU/1 GPU/3 GPU/2 gpu009:47867:48593 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 1, bw 10.000000/10.000000, type SYS/PIX, sameChannels 1 gpu009:47867:48593 [0] NCCL INFO 0 : GPU/0 GPU/1 GPU/3 GPU/2 gpu009:47870:48592 [3] NCCL INFO Ring 00 : 2 -> 3 -> 0 gpu009:47870:48592 [3] NCCL INFO Ring 01 : 2 -> 3 -> 0 gpu009:47868:48591 [1] NCCL INFO Tree 0 : 0 -> 1 -> 3/-1/-1 gpu009:47870:48592 [3] NCCL INFO Trees [0] 2/-1/-1->3->1 [1] 2/-1/-1->3->1 gpu009:47867:48593 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 gpu009:47868:48591 [1] NCCL INFO Tree 1 : 0 -> 1 -> 3/-1/-1 gpu009:47869:48590 [2] NCCL INFO Ring 00 : 1 -> 2 -> 3 gpu009:47870:48592 [3] NCCL INFO P2P Chunksize set to 131072 gpu009:47867:48593 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1 gpu009:47868:48591 [1] NCCL INFO Ring 00 : 0 -> 1 -> 2 gpu009:47869:48590 [2] NCCL INFO Ring 01 : 1 -> 2 -> 3 gpu009:47868:48591 [1] NCCL INFO Ring 01 : 0 -> 1 -> 2 gpu009:47869:48590 [2] NCCL INFO Trees [0] -1/-1/-1->2->3 [1] -1/-1/-1->2->3 gpu009:47867:48593 [0] NCCL INFO Channel 00/02 : 0 1 2 3 gpu009:47870:48592 [3] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536) gpu009:47868:48591 [1] NCCL INFO Trees [0] 3/-1/-1->1->0 [1] 3/-1/-1->1->0 gpu009:47869:48590 [2] NCCL INFO P2P Chunksize set to 131072 gpu009:47867:48593 [0] NCCL INFO Channel 01/02 : 0 1 2 3 gpu009:47868:48591 [1] NCCL INFO P2P Chunksize set to 131072 gpu009:47867:48593 [0] NCCL INFO Ring 00 : 3 -> 0 -> 1 gpu009:47867:48593 [0] NCCL INFO Ring 01 : 3 -> 0 -> 1 gpu009:47867:48593 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 gpu009:47867:48593 [0] NCCL INFO P2P Chunksize set to 131072 gpu009:47867:48593 [0] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536) gpu009:47868:48591 [1] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536) gpu009:47869:48590 [2] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536) gpu009:47870:48592 [3] NCCL INFO channel.cc:40 Cuda Alloc Size 1536 pointer 0x7fbfc1a00000 gpu009:47870:48592 [3] NCCL INFO channel.cc:43 Cuda Alloc Size 72 pointer 0x7fbfc1a00600 gpu009:47870:48592 [3] NCCL INFO channel.cc:54 Cuda Alloc Size 16 pointer 0x7fbfc1a00800 gpu009:47870:48592 [3] NCCL INFO channel.cc:40 Cuda Alloc Size 1536 pointer 0x7fbfc1a00a00 gpu009:47870:48592 [3] NCCL INFO channel.cc:43 Cuda Alloc Size 72 pointer 0x7fbfc1a01000 gpu009:47870:48592 [3] NCCL INFO channel.cc:54 Cuda Alloc Size 16 pointer 0x7fbfc1a01200 gpu009:47870:48592 [3] NCCL INFO Allocated 9637892 bytes of shared memory in /dev/shm/nccl-AP8lNO gpu009:47867:48593 [0] NCCL INFO channel.cc:40 Cuda Alloc Size 1536 pointer 0x7f2adda00000 gpu009:47868:48591 [1] NCCL INFO channel.cc:40 Cuda Alloc Size 1536 pointer 0x7fc391a00000
请问你们用的什么微调框架呢?
请问你们用的什么微调框架呢? 用的全量微调框架,就finetune_ds.sh
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
当前行为 | Current Behavior
数据集加载都没有问题,模型一直卡在finetune.py文件中的trainer.trian()
期望行为 | Expected Behavior
No response
复现方法 | Steps To Reproduce
数据: [ { "id": "0", "image": "path/image/001.jpg", "conversations": [ { "role": "user", "content": "\nHow many desserts are on the white plate?"
},
{
"role": "assistant",
"content": "There are three desserts on the white plate."
},
{ "role": "user", "content": "What type of desserts are they?" }, { "role": "assistant", "content": "The desserts are cakes with bananas and pecans on top. They share similarities with donuts, but the presence of bananas and pecans differentiates them." }, { "role": "user", "content": "What is the setting of the image?" }, { "role": "assistant", "content": "The image is set on a table top with a plate containing the three desserts." } ] } ]
运行环境 | Environment
备注 | Anything else?
输出: prepare trainer Training dataset length: 1 Validation dataset length: 1 <class 'trainer.CPMTrainer'> trainer ok
错误信息: Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. max_steps is given, it will override any value given in num_train_epochs max_steps is given, it will override any value given in num_train_epochs max_steps is given, it will override any value given in num_train_epochs max_steps is given, it will override any value given in num_train_epochs Using /public/home/lzu2/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Using /public/home/lzu2/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Using /public/home/lzu2/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Using /public/home/lzu2/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
部分代码: