OpenBMB / MiniCPM-V

MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone
Apache License 2.0
12.14k stars 850 forks source link

[BUG] <title>模型卡在trainer.train()一直不训练 #265

Closed limllzu closed 3 months ago

limllzu commented 3 months ago

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

数据集加载都没有问题,模型一直卡在finetune.py文件中的trainer.trian()

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

数据: [ { "id": "0", "image": "path/image/001.jpg", "conversations": [ { "role": "user", "content": "\nHow many desserts are on the white plate?" }, { "role": "assistant", "content": "There are three desserts on the white plate." },
{ "role": "user", "content": "What type of desserts are they?" }, { "role": "assistant", "content": "The desserts are cakes with bananas and pecans on top. They share similarities with donuts, but the presence of bananas and pecans differentiates them." }, { "role": "user", "content": "What is the setting of the image?" }, { "role": "assistant", "content": "The image is set on a table top with a plate containing the three desserts." } ] } ]

运行环境 | Environment

包环境;
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
absl-py                   2.1.0                    pypi_0    pypi
accelerate                0.30.1                   pypi_0    pypi
addict                    2.4.0                    pypi_0    pypi
aiofiles                  23.2.1                   pypi_0    pypi
altair                    5.3.0                    pypi_0    pypi
annotated-types           0.7.0                    pypi_0    pypi
anyio                     4.4.0                    pypi_0    pypi
attrs                     23.2.0                   pypi_0    pypi
binutils_impl_linux-64    2.36.1               h193b22a_2    conda-forge
binutils_linux-64         2.36                hf3e587d_10    conda-forge
bitsandbytes-cuda114      0.26.0.post2             pypi_0    pypi
blessed                   1.20.0                   pypi_0    pypi
blinker                   1.8.2                    pypi_0    pypi
blis                      0.7.11                   pypi_0    pypi
bzip2                     1.0.8                h5eee18b_6    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
ca-certificates           2024.6.2             hbcca054_0    conda-forge
cachetools                5.3.3                    pypi_0    pypi
catalogue                 2.0.10                   pypi_0    pypi
certifi                   2024.2.2                 pypi_0    pypi
charset-normalizer        3.3.2                    pypi_0    pypi
click                     8.1.7                    pypi_0    pypi
cloudpathlib              0.16.0                   pypi_0    pypi
cmake                     3.25.0                   pypi_0    pypi
colorama                  0.4.6                    pypi_0    pypi
confection                0.1.5                    pypi_0    pypi
contourpy                 1.2.1                    pypi_0    pypi
cycler                    0.12.1                   pypi_0    pypi
cymem                     2.0.8                    pypi_0    pypi
deepspeed                 0.14.3                   pypi_0    pypi
editdistance              0.6.2                    pypi_0    pypi
einops                    0.7.0                    pypi_0    pypi
et-xmlfile                1.1.0                    pypi_0    pypi
exceptiongroup            1.2.1                    pypi_0    pypi
fairscale                 0.4.0                    pypi_0    pypi
fastapi                   0.110.3                  pypi_0    pypi
ffmpy                     0.3.2                    pypi_0    pypi
filelock                  3.14.0                   pypi_0    pypi
flask                     3.0.3                    pypi_0    pypi
fonttools                 4.53.0                   pypi_0    pypi
fsspec                    2024.5.0                 pypi_0    pypi
gcc_impl_linux-64         11.2.0              h82a94d6_16    conda-forge
gcc_linux-64              11.2.0              h39a9532_10    conda-forge
gpustat                   1.1.1                    pypi_0    pypi
gradio                    4.26.0                   pypi_0    pypi
gradio-client             0.15.1                   pypi_0    pypi
grpcio                    1.64.1                   pypi_0    pypi
gxx_impl_linux-64         11.2.0              h82a94d6_16    conda-forge
gxx_linux-64              11.2.0              hacbe6df_10    conda-forge
h11                       0.14.0                   pypi_0    pypi
hjson                     3.1.0                    pypi_0    pypi
httpcore                  1.0.5                    pypi_0    pypi
httpx                     0.27.0                   pypi_0    pypi
huggingface-hub           0.23.2                   pypi_0    pypi
idna                      3.7                      pypi_0    pypi
importlib-resources       6.4.0                    pypi_0    pypi
install                   1.3.5                    pypi_0    pypi
itsdangerous              2.2.0                    pypi_0    pypi
jinja2                    3.1.4                    pypi_0    pypi
joblib                    1.4.2                    pypi_0    pypi
jsonlines                 4.0.0                    pypi_0    pypi
jsonschema                4.22.0                   pypi_0    pypi
jsonschema-specifications 2023.12.1                pypi_0    pypi
kernel-headers_linux-64   2.6.32              he073ed8_17    conda-forge
kiwisolver                1.4.5                    pypi_0    pypi
langcodes                 3.4.0                    pypi_0    pypi
language-data             1.2.0                    pypi_0    pypi
ld_impl_linux-64          2.36.1               hea4e1c9_2    conda-forge
libaio                    0.9.3                    pypi_0    pypi
libffi                    3.4.4                h6a678d5_1    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
libgcc-devel_linux-64     11.2.0              h0952999_16    conda-forge
libgcc-ng                 13.2.0               h77fa898_7    conda-forge
libgomp                   13.2.0               h77fa898_7    conda-forge
libsanitizer              11.2.0              he4da1e4_16    conda-forge
libstdcxx-devel_linux-64  11.2.0              h0952999_16    conda-forge
libstdcxx-ng              13.2.0               hc0a3c3a_7    conda-forge
libuuid                   1.41.5               h5eee18b_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
lit                       15.0.7                   pypi_0    pypi
lxml                      5.2.2                    pypi_0    pypi
marisa-trie               1.1.1                    pypi_0    pypi
markdown                  3.6                      pypi_0    pypi
markdown-it-py            3.0.0                    pypi_0    pypi
markdown2                 2.4.10                   pypi_0    pypi
markupsafe                2.1.5                    pypi_0    pypi
matplotlib                3.7.4                    pypi_0    pypi
mdurl                     0.1.2                    pypi_0    pypi
more-itertools            10.1.0                   pypi_0    pypi
mpmath                    1.3.0                    pypi_0    pypi
murmurhash                1.0.10                   pypi_0    pypi
ncurses                   6.4                  h6a678d5_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
networkx                  3.3                      pypi_0    pypi
ninja                     1.10.0                   pypi_0    pypi
ninja-base                1.10.2               hd09550d_5    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
nltk                      3.8.1                    pypi_0    pypi
numpy                     1.24.4                   pypi_0    pypi
nvidia-cublas-cu12        12.1.3.1                 pypi_0    pypi
nvidia-cuda-cupti-cu12    12.1.105                 pypi_0    pypi
nvidia-cuda-nvrtc-cu12    12.1.105                 pypi_0    pypi
nvidia-cuda-runtime-cu12  12.1.105                 pypi_0    pypi
nvidia-cudnn-cu12         8.9.2.26                 pypi_0    pypi
nvidia-cufft-cu12         11.0.2.54                pypi_0    pypi
nvidia-curand-cu12        10.3.2.106               pypi_0    pypi
nvidia-cusolver-cu12      11.4.5.107               pypi_0    pypi
nvidia-cusparse-cu12      12.1.0.106               pypi_0    pypi
nvidia-ml-py              12.535.161               pypi_0    pypi
nvidia-nccl-cu12          2.18.1                   pypi_0    pypi
nvidia-nvjitlink-cu12     12.5.40                  pypi_0    pypi
nvidia-nvtx-cu12          12.1.105                 pypi_0    pypi
nvitop                    1.3.2                    pypi_0    pypi
opencv-python-headless    4.5.5.64                 pypi_0    pypi
openpyxl                  3.1.2                    pypi_0    pypi
openssl                   3.3.1                h4ab18f5_0    conda-forge
orjson                    3.10.3                   pypi_0    pypi
packaging                 23.2                     pypi_0    pypi
pandas                    2.2.2                    pypi_0    pypi
peft                      0.11.1                   pypi_0    pypi
pillow                    10.1.0                   pypi_0    pypi
pip                       24.0            py310h06a4308_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
portalocker               2.8.2                    pypi_0    pypi
preshed                   3.0.9                    pypi_0    pypi
protobuf                  4.25.0                   pypi_0    pypi
psutil                    5.9.8                    pypi_0    pypi
py-cpuinfo                9.0.0                    pypi_0    pypi
pydantic                  2.7.2                    pypi_0    pypi
pydantic-core             2.18.3                   pypi_0    pypi
pydub                     0.25.1                   pypi_0    pypi
pygments                  2.18.0                   pypi_0    pypi
pynvml                    11.5.0                   pypi_0    pypi
pyparsing                 3.1.2                    pypi_0    pypi
pyproject                 1.3.1                    pypi_0    pypi
python                    3.10.14              h955ad1f_1    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
python-dateutil           2.9.0.post0              pypi_0    pypi
python-multipart          0.0.9                    pypi_0    pypi
pytz                      2024.1                   pypi_0    pypi
pyyaml                    6.0.1                    pypi_0    pypi
readline                  8.2                  h5eee18b_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
referencing               0.35.1                   pypi_0    pypi
regex                     2024.5.15                pypi_0    pypi
requests                  2.32.3                   pypi_0    pypi
rich                      13.7.1                   pypi_0    pypi
rpds-py                   0.18.1                   pypi_0    pypi
ruff                      0.4.7                    pypi_0    pypi
sacrebleu                 2.3.2                    pypi_0    pypi
safetensors               0.4.3                    pypi_0    pypi
seaborn                   0.13.0                   pypi_0    pypi
semantic-version          2.10.0                   pypi_0    pypi
sentencepiece             0.1.99                   pypi_0    pypi
setuptools                69.5.1          py310h06a4308_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
shellingham               1.5.4                    pypi_0    pypi
shortuuid                 1.0.11                   pypi_0    pypi
six                       1.16.0                   pypi_0    pypi
smart-open                6.4.0                    pypi_0    pypi
sniffio                   1.3.1                    pypi_0    pypi
socksio                   1.0.0                    pypi_0    pypi
spacy                     3.7.2                    pypi_0    pypi
spacy-legacy              3.0.12                   pypi_0    pypi
spacy-loggers             1.0.5                    pypi_0    pypi
sqlite                    3.45.3               h5eee18b_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
srsly                     2.4.8                    pypi_0    pypi
starlette                 0.37.2                   pypi_0    pypi
sympy                     1.12.1                   pypi_0    pypi
sysroot_linux-64          2.12                he073ed8_17    conda-forge
tabulate                  0.9.0                    pypi_0    pypi
tensorboard               2.16.2                   pypi_0    pypi
tensorboard-data-server   0.7.2                    pypi_0    pypi
tensorboardx              1.8                      pypi_0    pypi
termcolor                 2.4.0                    pypi_0    pypi
thinc                     8.2.3                    pypi_0    pypi
timm                      0.9.10                   pypi_0    pypi
tk                        8.6.14               h39e8969_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
tokenizers                0.19.1                   pypi_0    pypi
tomlkit                   0.12.0                   pypi_0    pypi
toolz                     0.12.1                   pypi_0    pypi
torch                     2.1.2+cu118              pypi_0    pypi
torchaudio                2.1.2+cu118              pypi_0    pypi
torchvision               0.16.2+cu118             pypi_0    pypi
tqdm                      4.66.1                   pypi_0    pypi
transformers              4.41.2                   pypi_0    pypi
triton                    2.1.0                    pypi_0    pypi
typer                     0.9.4                    pypi_0    pypi
typing-extensions         4.8.0                    pypi_0    pypi
tzdata                    2024.1                   pypi_0    pypi
urllib3                   2.2.1                    pypi_0    pypi
uvicorn                   0.24.0.post1             pypi_0    pypi
wasabi                    1.1.3                    pypi_0    pypi
wcwidth                   0.2.13                   pypi_0    pypi
weasel                    0.3.4                    pypi_0    pypi
websockets                11.0.3                   pypi_0    pypi
werkzeug                  3.0.3                    pypi_0    pypi
wheel                     0.43.0          py310h06a4308_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
xz                        5.4.6                h5eee18b_1    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
zlib                      1.2.13               h5eee18b_1    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main

备注 | Anything else?

输出: prepare trainer Training dataset length: 1 Validation dataset length: 1 <class 'trainer.CPMTrainer'> trainer ok

错误信息: Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. max_steps is given, it will override any value given in num_train_epochs max_steps is given, it will override any value given in num_train_epochs max_steps is given, it will override any value given in num_train_epochs max_steps is given, it will override any value given in num_train_epochs Using /public/home/lzu2/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Using /public/home/lzu2/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Using /public/home/lzu2/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Using /public/home/lzu2/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...

部分代码:

# 检查数据集长度
print(f"Training dataset length: {len(data_module['train_dataset'])}")
print(f"Validation dataset length: {len(data_module['eval_dataset'])}")

rank0_print("prepare trainer")

trainer = CPMTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    **data_module,
)

rank0_print(type(trainer))

rank0_print("trainer ok")

trainer.train()

trainer.save_state()

rank0_print("trainer sucess")
qyc-98 commented 3 months ago

你的数据集有多大呢?

limllzu commented 3 months ago

你的数据集有多大呢?

你的数据集有多大呢?

数据集只有一条数据,是官方demo提供的 如下:

[
    {
        "id": "0",
        "image": "path/image/image_0.jpg",
        "conversations": [
            {
              "role": "user", 
              "content": "<image>\nHow many desserts are on the white plate?"
            }, 
            {
                "role": "assistant", 
                "content": "There are three desserts on the white plate."
            },   
            {
                "role": "user", 
                "content": "What type of desserts are they?"
            },
            {
                "role": "assistant", 
                "content": "The desserts are cakes with bananas and pecans on top. They share similarities with donuts, but the presence of bananas and pecans differentiates them."
            }, 
            {
                "role": "user", 
                "content": "What is the setting of the image?"
            }, 
            {
                "role": "assistant", 
                "content": "The image is set on a table top with a plate containing the three desserts."
            }
        ]
    }
]
qyc-98 commented 3 months ago

我的环境是这样的 你可以参考一下 requirements.txt 我的linux内核版本是5.4.0,不知道是不是因为你的版本是3.10.0导致的

limllzu commented 3 months ago

我的环境是这样的 你可以参考一下 requirements.txt 我的linux内核版本是5.4.0,不知道是不是因为你的版本是3.10.0导致的

感谢您的回答,我发现好像是我NCCL没有安装的原因,但是我安装以后又出现了新的问题,您能帮我看一下吗?谢谢 错误信息: Traceback (most recent call last): File "/LLM/openbmb/MiniCPM-V/finetune/finetune.py", line 708, in train() File "/LLM/openbmb/MiniCPM-V/finetune/finetune.py", line 690, in train trainer.train() File "/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train return inner_training_loop( File "/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 2045, in _inner_training_loop model, self.optimizer, self.lr_scheduler = self.accelerator.prepare( File "/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1284, in prepare result = self._prepare_deepspeed(*args) File "/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1751, in _preparedeepspeed engine, optimizer, , lr_scheduler = deepspeed.initialize(kwargs) File "/envs/llm/lib/python3.10/site-packages/deepspeed/init.py", line 181, in initialize engine = DeepSpeedEngine(args=args, File "/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 262, in init self._configure_distributed_model(model) File "/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1149, in _configure_distributed_model self._broadcast_model() File "/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1069, in _broadcast_model dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group) File "/envs/llm/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper return func(*args, *kwargs) File "/envs/llm/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 224, in broadcast return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) File "/envs/llm/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn return fn(args, kwargs) File "/envs/llm/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 199, in broadcast return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) File "/envs/llm/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper return func(*args, **kwargs) File "/envs/llm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1910, in broadcast work = group.broadcast([tensor], opts) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1251, internal error - please report this issue to the NCCL developers, NCCL version 2.18.6 ncclInternalError: Internal check failed. Last error: Bootstrap : no socket interface found [2024-06-14 16:05:19,306] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 30898 closing signal SIGTERM [2024-06-14 16:05:19,306] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 30899 closing signal SIGTERM [2024-06-14 16:05:19,307] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 30900 closing signal SIGTERM [2024-06-14 16:05:20,135] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 30897) of binary: /envs/llm/bin/python

输出信息: prepare trainer trainer ok [2024-06-14 16:05:12,128] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.3, git-hash=unknown, git-branch=unknown gpu009:30897:30897 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ens1f1 gpu009:30897:30897 [0] NCCL INFO NCCL_SOCKET_IFNAME set to ens1f1 gpu009:30897:30897 [0] bootstrap.cc:45 NCCL WARN Bootstrap : no socket interface found gpu009:30897:30897 [0] NCCL INFO init.cc:82 -> 3 gpu009:30897:30897 [0] NCCL INFO init.cc:101 -> 3

从日志信息看,我好像是没有正确设置网络接口,但是我使用ifconfig命令查找的时候是有ens1f1这个接口的,并且也可以ping通。 麻烦您帮我看一下,谢谢!!!

limllzu commented 3 months ago

我的环境是这样的 你可以参考一下 requirements.txt 我的linux内核版本是5.4.0,不知道是不是因为你的版本是3.10.0导致的

感谢您的回答,我发现好像是我NCCL没有安装的原因,但是我安装以后又出现了新的问题,您能帮我看一下吗?谢谢 错误信息: Traceback (most recent call last): File "/LLM/openbmb/MiniCPM-V/finetune/finetune.py", line 708, in train() File "/LLM/openbmb/MiniCPM-V/finetune/finetune.py", line 690, in train trainer.train() File "/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train return inner_training_loop( File "/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 2045, in _inner_training_loop model, self.optimizer, self.lr_scheduler = self.accelerator.prepare( File "/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1284, in prepare result = self._prepare_deepspeed(*args) File "/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1751, in _preparedeepspeed engine, optimizer, , lr_scheduler = deepspeed.initialize(kwargs) File "/envs/llm/lib/python3.10/site-packages/deepspeed/init.py", line 181, in initialize engine = DeepSpeedEngine(args=args, File "/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 262, in init self._configure_distributed_model(model) File "/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1149, in _configure_distributed_model self._broadcast_model() File "/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1069, in _broadcast_model dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group) File "/envs/llm/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper return func(*args, *kwargs) File "/envs/llm/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 224, in broadcast return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) File "/envs/llm/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn return fn(args, kwargs) File "/envs/llm/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 199, in broadcast return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) File "/envs/llm/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper return func(*args, **kwargs) File "/envs/llm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1910, in broadcast work = group.broadcast([tensor], opts) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1251, internal error - please report this issue to the NCCL developers, NCCL version 2.18.6 ncclInternalError: Internal check failed. Last error: Bootstrap : no socket interface found [2024-06-14 16:05:19,306] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 30898 closing signal SIGTERM [2024-06-14 16:05:19,306] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 30899 closing signal SIGTERM [2024-06-14 16:05:19,307] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 30900 closing signal SIGTERM [2024-06-14 16:05:20,135] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 30897) of binary: /envs/llm/bin/python

输出信息: prepare trainer trainer ok [2024-06-14 16:05:12,128] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.3, git-hash=unknown, git-branch=unknown gpu009:30897:30897 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ens1f1 gpu009:30897:30897 [0] NCCL INFO NCCL_SOCKET_IFNAME set to ens1f1 gpu009:30897:30897 [0] bootstrap.cc:45 NCCL WARN Bootstrap : no socket interface found gpu009:30897:30897 [0] NCCL INFO init.cc:82 -> 3 gpu009:30897:30897 [0] NCCL INFO init.cc:101 -> 3

从日志信息看,我好像是没有正确设置网络接口,但是我使用ifconfig命令查找的时候是有ens1f1这个接口的,并且也可以ping通。 麻烦您帮我看一下,谢谢!!!

当我把网络接口切换到ib0的时候,它不会报错,但是根据NCCL日志信息,它还是处于挂起状态,没有训练 输出信息: prepare trainer Training dataset length: 1 Validation dataset length: 1 trainer ok [2024-06-14 16:59:42,697] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.3, git-hash=unknown, git-branch=unknown gpu009:47867:47867 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ib0 gpu009:47867:47867 [0] NCCL INFO NCCL_SOCKET_IFNAME set to ib0 gpu009:47867:47867 [0] NCCL INFO Bootstrap : Using ib0:11.11.8.9<0> gpu009:47867:47867 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory gpu009:47867:47867 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation gpu009:47868:47868 [1] NCCL INFO cudaDriverVersion 12000 gpu009:47869:47869 [2] NCCL INFO cudaDriverVersion 12000 gpu009:47868:47868 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ib0 gpu009:47868:47868 [1] NCCL INFO NCCL_SOCKET_IFNAME set to ib0 gpu009:47869:47869 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ib0 gpu009:47869:47869 [2] NCCL INFO NCCL_SOCKET_IFNAME set to ib0 gpu009:47870:47870 [3] NCCL INFO cudaDriverVersion 12000 gpu009:47868:47868 [1] NCCL INFO Bootstrap : Using ib0:11.11.8.9<0> gpu009:47869:47869 [2] NCCL INFO Bootstrap : Using ib0:11.11.8.9<0> gpu009:47869:47869 [2] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory gpu009:47868:47868 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory gpu009:47869:47869 [2] NCCL INFO NET/Plugin : No plugin found, using internal implementation gpu009:47868:47868 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation gpu009:47870:47870 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ib0 gpu009:47870:47870 [3] NCCL INFO NCCL_SOCKET_IFNAME set to ib0 gpu009:47869:47869 [2] NCCL INFO init.cc:1584 Cuda Host Alloc Size 4 pointer 0x7fc46e800000 gpu009:47868:47868 [1] NCCL INFO init.cc:1584 Cuda Host Alloc Size 4 pointer 0x7fc390800000 gpu009:47870:47870 [3] NCCL INFO Bootstrap : Using ib0:11.11.8.9<0> gpu009:47870:47870 [3] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory gpu009:47870:47870 [3] NCCL INFO NET/Plugin : No plugin found, using internal implementation gpu009:47870:47870 [3] NCCL INFO init.cc:1584 Cuda Host Alloc Size 4 pointer 0x7fbfc0800000 gpu009:47869:48590 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 1. gpu009:47869:48590 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ib0 gpu009:47869:48590 [2] NCCL INFO NET/Socket : Using [0]ib0:11.11.8.9<0> gpu009:47869:48590 [2] NCCL INFO Using network Socket gpu009:47868:48591 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1. gpu009:47868:48591 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ib0 gpu009:47867:47867 [0] NCCL INFO cudaDriverVersion 12000 NCCL version 2.18.6+cuda12.1 gpu009:47868:48591 [1] NCCL INFO NET/Socket : Using [0]ib0:11.11.8.9<0> gpu009:47868:48591 [1] NCCL INFO Using network Socket gpu009:47870:48592 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 1. gpu009:47867:47867 [0] NCCL INFO init.cc:1584 Cuda Host Alloc Size 4 pointer 0x7f2adc800000 gpu009:47870:48592 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ib0 gpu009:47870:48592 [3] NCCL INFO NET/Socket : Using [0]ib0:11.11.8.9<0> gpu009:47870:48592 [3] NCCL INFO Using network Socket gpu009:47867:48593 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1. gpu009:47867:48593 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ib0 gpu009:47867:48593 [0] NCCL INFO NET/Socket : Using [0]ib0:11.11.8.9<0> gpu009:47867:48593 [0] NCCL INFO Using network Socket gpu009:47867:48593 [0] NCCL INFO comm 0x7ef5c880 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 50000 commId 0x361b5540a6088610 - Init START gpu009:47870:48592 [3] NCCL INFO comm 0x68da1c00 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 9c000 commId 0x361b5540a6088610 - Init START gpu009:47869:48590 [2] NCCL INFO comm 0x69374b40 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 57000 commId 0x361b5540a6088610 - Init START gpu009:47868:48591 [1] NCCL INFO comm 0x68b107c0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 53000 commId 0x361b5540a6088610 - Init START gpu009:47870:48592 [3] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 'ib0' gpu009:47868:48591 [1] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 'ib0' gpu009:47869:48590 [2] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 'ib0' gpu009:47867:48593 [0] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 'ib0' gpu009:47870:48592 [3] NCCL INFO NCCL_P2P_LEVEL set by environment to NVL gpu009:47870:48592 [3] NCCL INFO === System : maxBw 24.0 totalBw 24.0 === gpu009:47870:48592 [3] NCCL INFO CPU/0 (1/1/2) gpu009:47870:48592 [3] NCCL INFO + PCI[24.0] - PCI/4B000 (1000c01010000000) gpu009:47870:48592 [3] NCCL INFO + PCI[24.0] - GPU/50000 (0) gpu009:47870:48592 [3] NCCL INFO + PCI[24.0] - GPU/53000 (1) gpu009:47870:48592 [3] NCCL INFO + PCI[24.0] - NIC/56000 gpu009:47870:48592 [3] NCCL INFO + PCI[24.0] - GPU/57000 (2) gpu009:47870:48592 [3] NCCL INFO + SYS[10.0] - CPU/1 gpu009:47870:48592 [3] NCCL INFO CPU/1 (1/1/2) gpu009:47870:48592 [3] NCCL INFO + PCI[24.0] - PCI/98000 (1000c01010000000) gpu009:47870:48592 [3] NCCL INFO + PCI[24.0] - GPU/9C000 (3) gpu009:47870:48592 [3] NCCL INFO + SYS[10.0] - CPU/0 gpu009:47870:48592 [3] NCCL INFO ========================================== gpu009:47870:48592 [3] NCCL INFO GPU/50000 :GPU/50000 (0/5000.000000/LOC) GPU/53000 (4/24.000000/PHB) GPU/57000 (4/24.000000/PHB) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS) gpu009:47868:48591 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to NVL gpu009:47870:48592 [3] NCCL INFO GPU/53000 :GPU/50000 (4/24.000000/PHB) GPU/53000 (0/5000.000000/LOC) GPU/57000 (4/24.000000/PHB) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS) gpu009:47870:48592 [3] NCCL INFO GPU/57000 :GPU/50000 (4/24.000000/PHB) GPU/53000 (4/24.000000/PHB) GPU/57000 (0/5000.000000/LOC) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS) gpu009:47870:48592 [3] NCCL INFO GPU/9C000 :GPU/50000 (5/10.000000/SYS) GPU/53000 (5/10.000000/SYS) GPU/57000 (5/10.000000/SYS) GPU/9C000 (0/5000.000000/LOC) CPU/0 (3/10.000000/SYS) CPU/1 (2/24.000000/PHB) gpu009:47870:48592 [3] NCCL INFO Setting affinity for GPU 3 to 3ff00000,0000003f,f0000000 gpu009:47870:48592 [3] NCCL INFO NVLS multicast support is not available on dev 3 gpu009:47869:48590 [2] NCCL INFO NCCL_P2P_LEVEL set by environment to NVL gpu009:47867:48593 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to NVL gpu009:47868:48591 [1] NCCL INFO === System : maxBw 24.0 totalBw 24.0 === gpu009:47868:48591 [1] NCCL INFO CPU/0 (1/1/2) gpu009:47868:48591 [1] NCCL INFO + PCI[24.0] - PCI/4B000 (1000c01010000000) gpu009:47868:48591 [1] NCCL INFO + PCI[24.0] - GPU/50000 (0) gpu009:47868:48591 [1] NCCL INFO + PCI[24.0] - GPU/53000 (1) gpu009:47868:48591 [1] NCCL INFO + PCI[24.0] - NIC/56000 gpu009:47868:48591 [1] NCCL INFO + PCI[24.0] - GPU/57000 (2) gpu009:47868:48591 [1] NCCL INFO + SYS[10.0] - CPU/1 gpu009:47868:48591 [1] NCCL INFO CPU/1 (1/1/2) gpu009:47868:48591 [1] NCCL INFO + PCI[24.0] - PCI/98000 (1000c01010000000) gpu009:47868:48591 [1] NCCL INFO + PCI[24.0] - GPU/9C000 (3) gpu009:47869:48590 [2] NCCL INFO === System : maxBw 24.0 totalBw 24.0 === gpu009:47868:48591 [1] NCCL INFO + SYS[10.0] - CPU/0 gpu009:47867:48593 [0] NCCL INFO === System : maxBw 24.0 totalBw 24.0 === gpu009:47869:48590 [2] NCCL INFO CPU/0 (1/1/2) gpu009:47868:48591 [1] NCCL INFO ========================================== gpu009:47867:48593 [0] NCCL INFO CPU/0 (1/1/2) gpu009:47869:48590 [2] NCCL INFO + PCI[24.0] - PCI/4B000 (1000c01010000000) gpu009:47868:48591 [1] NCCL INFO GPU/50000 :GPU/50000 (0/5000.000000/LOC) GPU/53000 (4/24.000000/PHB) GPU/57000 (4/24.000000/PHB) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS) gpu009:47867:48593 [0] NCCL INFO + PCI[24.0] - PCI/4B000 (1000c01010000000) gpu009:47869:48590 [2] NCCL INFO + PCI[24.0] - GPU/50000 (0) gpu009:47868:48591 [1] NCCL INFO GPU/53000 :GPU/50000 (4/24.000000/PHB) GPU/53000 (0/5000.000000/LOC) GPU/57000 (4/24.000000/PHB) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS) gpu009:47867:48593 [0] NCCL INFO + PCI[24.0] - GPU/50000 (0) gpu009:47869:48590 [2] NCCL INFO + PCI[24.0] - GPU/53000 (1) gpu009:47868:48591 [1] NCCL INFO GPU/57000 :GPU/50000 (4/24.000000/PHB) GPU/53000 (4/24.000000/PHB) GPU/57000 (0/5000.000000/LOC) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS) gpu009:47867:48593 [0] NCCL INFO + PCI[24.0] - GPU/53000 (1) gpu009:47869:48590 [2] NCCL INFO + PCI[24.0] - NIC/56000 gpu009:47868:48591 [1] NCCL INFO GPU/9C000 :GPU/50000 (5/10.000000/SYS) GPU/53000 (5/10.000000/SYS) GPU/57000 (5/10.000000/SYS) GPU/9C000 (0/5000.000000/LOC) CPU/0 (3/10.000000/SYS) CPU/1 (2/24.000000/PHB) gpu009:47867:48593 [0] NCCL INFO + PCI[24.0] - NIC/56000 gpu009:47869:48590 [2] NCCL INFO + PCI[24.0] - GPU/57000 (2) gpu009:47870:48592 [3] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 7.000000/7.000000, type SYS/PIX, sameChannels 1 gpu009:47868:48591 [1] NCCL INFO Setting affinity for GPU 1 to 03ff,00000000,0003ff00 gpu009:47867:48593 [0] NCCL INFO + PCI[24.0] - GPU/57000 (2) gpu009:47869:48590 [2] NCCL INFO + SYS[10.0] - CPU/1 gpu009:47870:48592 [3] NCCL INFO 0 : GPU/0 GPU/1 GPU/2 GPU/3 gpu009:47868:48591 [1] NCCL INFO NVLS multicast support is not available on dev 1 gpu009:47867:48593 [0] NCCL INFO + SYS[10.0] - CPU/1 gpu009:47869:48590 [2] NCCL INFO CPU/1 (1/1/2) gpu009:47867:48593 [0] NCCL INFO CPU/1 (1/1/2) gpu009:47869:48590 [2] NCCL INFO + PCI[24.0] - PCI/98000 (1000c01010000000) gpu009:47867:48593 [0] NCCL INFO + PCI[24.0] - PCI/98000 (1000c01010000000) gpu009:47870:48592 [3] NCCL INFO Pattern 1, crossNic 0, nChannels 1, bw 10.000000/10.000000, type SYS/PIX, sameChannels 1 gpu009:47869:48590 [2] NCCL INFO + PCI[24.0] - GPU/9C000 (3) gpu009:47867:48593 [0] NCCL INFO + PCI[24.0] - GPU/9C000 (3) gpu009:47870:48592 [3] NCCL INFO 0 : GPU/0 GPU/1 GPU/3 GPU/2 gpu009:47869:48590 [2] NCCL INFO + SYS[10.0] - CPU/0 gpu009:47867:48593 [0] NCCL INFO + SYS[10.0] - CPU/0 gpu009:47869:48590 [2] NCCL INFO ========================================== gpu009:47867:48593 [0] NCCL INFO ========================================== gpu009:47869:48590 [2] NCCL INFO GPU/50000 :GPU/50000 (0/5000.000000/LOC) GPU/53000 (4/24.000000/PHB) GPU/57000 (4/24.000000/PHB) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS) gpu009:47867:48593 [0] NCCL INFO GPU/50000 :GPU/50000 (0/5000.000000/LOC) GPU/53000 (4/24.000000/PHB) GPU/57000 (4/24.000000/PHB) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS) gpu009:47869:48590 [2] NCCL INFO GPU/53000 :GPU/50000 (4/24.000000/PHB) GPU/53000 (0/5000.000000/LOC) GPU/57000 (4/24.000000/PHB) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS) gpu009:47867:48593 [0] NCCL INFO GPU/53000 :GPU/50000 (4/24.000000/PHB) GPU/53000 (0/5000.000000/LOC) GPU/57000 (4/24.000000/PHB) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS) gpu009:47869:48590 [2] NCCL INFO GPU/57000 :GPU/50000 (4/24.000000/PHB) GPU/53000 (4/24.000000/PHB) GPU/57000 (0/5000.000000/LOC) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS) gpu009:47867:48593 [0] NCCL INFO GPU/57000 :GPU/50000 (4/24.000000/PHB) GPU/53000 (4/24.000000/PHB) GPU/57000 (0/5000.000000/LOC) GPU/9C000 (5/10.000000/SYS) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS) gpu009:47869:48590 [2] NCCL INFO GPU/9C000 :GPU/50000 (5/10.000000/SYS) GPU/53000 (5/10.000000/SYS) GPU/57000 (5/10.000000/SYS) GPU/9C000 (0/5000.000000/LOC) CPU/0 (3/10.000000/SYS) CPU/1 (2/24.000000/PHB) gpu009:47867:48593 [0] NCCL INFO GPU/9C000 :GPU/50000 (5/10.000000/SYS) GPU/53000 (5/10.000000/SYS) GPU/57000 (5/10.000000/SYS) GPU/9C000 (0/5000.000000/LOC) CPU/0 (3/10.000000/SYS) CPU/1 (2/24.000000/PHB) gpu009:47869:48590 [2] NCCL INFO Setting affinity for GPU 2 to 03ff,00000000,0003ff00 gpu009:47867:48593 [0] NCCL INFO Setting affinity for GPU 0 to 03ff,00000000,0003ff00 gpu009:47869:48590 [2] NCCL INFO NVLS multicast support is not available on dev 2 gpu009:47867:48593 [0] NCCL INFO NVLS multicast support is not available on dev 0 gpu009:47868:48591 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 7.000000/7.000000, type SYS/PIX, sameChannels 1 gpu009:47868:48591 [1] NCCL INFO 0 : GPU/0 GPU/1 GPU/2 GPU/3 gpu009:47868:48591 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 1, bw 10.000000/10.000000, type SYS/PIX, sameChannels 1 gpu009:47868:48591 [1] NCCL INFO 0 : GPU/0 GPU/1 GPU/3 GPU/2 gpu009:47869:48590 [2] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 7.000000/7.000000, type SYS/PIX, sameChannels 1 gpu009:47869:48590 [2] NCCL INFO 0 : GPU/0 GPU/1 GPU/2 GPU/3 gpu009:47867:48593 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 7.000000/7.000000, type SYS/PIX, sameChannels 1 gpu009:47867:48593 [0] NCCL INFO 0 : GPU/0 GPU/1 GPU/2 GPU/3 gpu009:47869:48590 [2] NCCL INFO Pattern 1, crossNic 0, nChannels 1, bw 10.000000/10.000000, type SYS/PIX, sameChannels 1 gpu009:47869:48590 [2] NCCL INFO 0 : GPU/0 GPU/1 GPU/3 GPU/2 gpu009:47867:48593 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 1, bw 10.000000/10.000000, type SYS/PIX, sameChannels 1 gpu009:47867:48593 [0] NCCL INFO 0 : GPU/0 GPU/1 GPU/3 GPU/2 gpu009:47870:48592 [3] NCCL INFO Ring 00 : 2 -> 3 -> 0 gpu009:47870:48592 [3] NCCL INFO Ring 01 : 2 -> 3 -> 0 gpu009:47868:48591 [1] NCCL INFO Tree 0 : 0 -> 1 -> 3/-1/-1 gpu009:47870:48592 [3] NCCL INFO Trees [0] 2/-1/-1->3->1 [1] 2/-1/-1->3->1 gpu009:47867:48593 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 gpu009:47868:48591 [1] NCCL INFO Tree 1 : 0 -> 1 -> 3/-1/-1 gpu009:47869:48590 [2] NCCL INFO Ring 00 : 1 -> 2 -> 3 gpu009:47870:48592 [3] NCCL INFO P2P Chunksize set to 131072 gpu009:47867:48593 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1 gpu009:47868:48591 [1] NCCL INFO Ring 00 : 0 -> 1 -> 2 gpu009:47869:48590 [2] NCCL INFO Ring 01 : 1 -> 2 -> 3 gpu009:47868:48591 [1] NCCL INFO Ring 01 : 0 -> 1 -> 2 gpu009:47869:48590 [2] NCCL INFO Trees [0] -1/-1/-1->2->3 [1] -1/-1/-1->2->3 gpu009:47867:48593 [0] NCCL INFO Channel 00/02 : 0 1 2 3 gpu009:47870:48592 [3] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536) gpu009:47868:48591 [1] NCCL INFO Trees [0] 3/-1/-1->1->0 [1] 3/-1/-1->1->0 gpu009:47869:48590 [2] NCCL INFO P2P Chunksize set to 131072 gpu009:47867:48593 [0] NCCL INFO Channel 01/02 : 0 1 2 3 gpu009:47868:48591 [1] NCCL INFO P2P Chunksize set to 131072 gpu009:47867:48593 [0] NCCL INFO Ring 00 : 3 -> 0 -> 1 gpu009:47867:48593 [0] NCCL INFO Ring 01 : 3 -> 0 -> 1 gpu009:47867:48593 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 gpu009:47867:48593 [0] NCCL INFO P2P Chunksize set to 131072 gpu009:47867:48593 [0] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536) gpu009:47868:48591 [1] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536) gpu009:47869:48590 [2] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536) gpu009:47870:48592 [3] NCCL INFO channel.cc:40 Cuda Alloc Size 1536 pointer 0x7fbfc1a00000 gpu009:47870:48592 [3] NCCL INFO channel.cc:43 Cuda Alloc Size 72 pointer 0x7fbfc1a00600 gpu009:47870:48592 [3] NCCL INFO channel.cc:54 Cuda Alloc Size 16 pointer 0x7fbfc1a00800 gpu009:47870:48592 [3] NCCL INFO channel.cc:40 Cuda Alloc Size 1536 pointer 0x7fbfc1a00a00 gpu009:47870:48592 [3] NCCL INFO channel.cc:43 Cuda Alloc Size 72 pointer 0x7fbfc1a01000 gpu009:47870:48592 [3] NCCL INFO channel.cc:54 Cuda Alloc Size 16 pointer 0x7fbfc1a01200 gpu009:47870:48592 [3] NCCL INFO Allocated 9637892 bytes of shared memory in /dev/shm/nccl-AP8lNO gpu009:47867:48593 [0] NCCL INFO channel.cc:40 Cuda Alloc Size 1536 pointer 0x7f2adda00000 gpu009:47868:48591 [1] NCCL INFO channel.cc:40 Cuda Alloc Size 1536 pointer 0x7fc391a00000

yihp commented 3 months ago

请问你们用的什么微调框架呢?

limllzu commented 3 months ago

请问你们用的什么微调框架呢? 用的全量微调框架,就finetune_ds.sh