yyq commented 1 year ago

I'm trying the demo code, here is the information: with CUDA 12.1

the command !python -c "import torch;print(torch.cuda.nccl.version())", can return (2, 14, 3)

below is the original import error stack:

ImportError Traceback (most recent call last) Cell In[10], line 1 ----> 1 from cpm_live.generation.bee import CPMBeeBeamSearch 2 from cpm_live.models import CPMBeeTorch, CPMBeeConfig 3 from cpm_live.tokenizers import CPMBeeTokenizer

File /workspace/cpm_live/generation/init.py:1 ----> 1 from .ant import CPMAntBeamSearch, CPMAntRandomSampling, CPMAntGeneration

File /workspace/cpm_live/generation/ant.py:4 2 import torch.nn.functional as F 3 from .generation_utils import BeamHypotheses, apply_repetition_penalty, top_k_top_p_filtering ----> 4 from ..utils import pad 7 class CPMAntGeneration: 8 def init(self, model, tokenizer, prompt_length=32):

File /workspace/cpm_live/utils/init.py:1 ----> 1 from .config import Config 2 from .data_utils import pad 3 from .object import allgather_objects

File /workspace/cpm_live/utils/config.py:20 18 import copy 19 from typing import Any, Dict, Union ---> 20 from .log import logger 23 def load_dataset_config(dataset_path: str): 24 cfg = json.load(open(dataset_path, "r", encoding="utf-8"))

File /workspace/cpm_live/utils/log.py:7 5 import json 6 import logging ----> 7 import bmtrain as bmt 10 # Set up the common logger 11 def _get_logger():

File /usr/local/lib/python3.10/dist-packages/bmtrain/init.py:2 1 from .global_var import config, world_size, rank ----> 2 from .init import init_distributed 4 from .parameter import DistributedParameter, ParameterInitializer 5 from .layer import DistributedModule

File /usr/local/lib/python3.10/dist-packages/bmtrain/init.py:8 6 from .utils import print_dict 7 from .global_var import config ----> 8 from . import nccl 9 from .synchronize import synchronize 10 def init_distributed( 11 init_method : str = "env://", 12 seed : int = 0, (...) 15 num_micro_batches: int = None, 16 ):

File /usr/local/lib/python3.10/dist-packages/bmtrain/nccl/init.py:4 2 from typing_extensions import Literal 3 import torch ----> 4 from . import _C as C 5 from .enums import * 7 class NCCLCommunicator:

ImportError: /usr/local/lib/python3.10/dist-packages/bmtrain/nccl/_C.cpython-310-x86_64-linux-gnu.so: undefined symbol: ncclBroadcast

MayDomine commented 1 year ago

To ensure that the CUDA version used to compile your Torch C++ plugin matches the runtime version of your current CUDA Toolkit, you can use the following Python command:

import torch
print(torch.version.cuda)

This command will print the CUDA version that was used to compile PyTorch. Please ensure that this version matches the version of your installed CUDA Toolkit.

In addition, please note that PyTorch version 2.0.0 and above are not yet supported. You should ensure that your installed version of PyTorch is less than 2.0.0. You can check the PyTorch version with the following Python command:

import torch
print(torch.__version__)

If your PyTorch version is not compatible, please downgrade PyTorch to a compatible version using pip or conda, depending on how you initially installed PyTorch.

yyq commented 1 year ago

To ensure that the CUDA version used to compile your Torch C++ plugin matches the runtime version of your current CUDA Toolkit, you can use the following Python command:
import torch
print(torch.version.cuda)
This command will print the CUDA version that was used to compile PyTorch. Please ensure that this version matches the version of your installed CUDA Toolkit.

In addition, please note that PyTorch version 2.0.0 and above are not yet supported. You should ensure that your installed version of PyTorch is less than 2.0.0. You can check the PyTorch version with the following Python command:
import torch
print(torch.__version__)
If your PyTorch version is not compatible, please downgrade PyTorch to a compatible version using pip or conda, depending on how you initially installed PyTorch.

I tried downgrade to torch.version.cuda=11.7 and touchversion=1.13.1+cu117, still the same error.

MayDomine commented 1 year ago

torch.version.cuda=11.7 and torchversion=1.13.1+cu117 only means the cuda version used to compile torch is 11.7.You need to make sure that the CUDA Toolkit version matches the version used to compile torch. You can use nvidia-smi or nvcc --version to check the version of CUDA Toolkit.

MathamPollard commented 1 year ago

cuda version:11.3 torch version: 1.12.1 print(torch.version.cuda):11.3 print(torch.cuda.is_available()): True !python -c "import torch;print(torch.cuda.nccl.version())", can return (2, 10, 3)

still the same error

26

MayDomine commented 1 year ago

please ensure that you have tried pip install bmtrain --no-cache-dir.

cuda version:11.3 torch version: 1.12.1 print(torch.version.cuda):11.3 print(torch.cuda.is_available()): True !python -c "import torch;print(torch.cuda.nccl.version())", can return (2, 10, 3)

still the same error #26

diaojunxian commented 1 year ago

please ensure that you have tried pip install bmtrain --no-cache-dir.

cuda version:11.3 torch version: 1.12.1 print(torch.version.cuda):11.3 print(torch.cuda.is_available()): True !python -c "import torch;print(torch.cuda.nccl.version())", can return (2, 10, 3) still the same error #26

@MayDomine hi, my server environment, also had the errors.

torch == 1.13.1+cu117

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0

LLMChild commented 1 year ago

please ensure that you have tried pip install bmtrain --no-cache-dir.

cuda version:11.3 torch version: 1.12.1 print(torch.version.cuda):11.3 print(torch.cuda.is_available()): True !python -c "import torch;print(torch.cuda.nccl.version())", can return (2, 10, 3) still the same error #26

@MayDomine hi, my server environment, also had the errors.
torch == 1.13.1+cu117

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0

这个环境我测试过不会出错，请检查cuda runtime的路径，pip安装是否使用cache、以及本地nccl版本是否有冲突等等

diaojunxian commented 1 year ago

please ensure that you have tried pip install bmtrain --no-cache-dir.

cuda version:11.3 torch version: 1.12.1 print(torch.version.cuda):11.3 print(torch.cuda.is_available()): True !python -c "import torch;print(torch.cuda.nccl.version())", can return (2, 10, 3) still the same error #26

@MayDomine hi, my server environment, also had the errors.
torch == 1.13.1+cu117

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0
这个环境我测试过不会出错，请检查cuda runtime的路径，pip安装是否使用cache、以及本地nccl版本是否有冲突等等

python -c "import torch;print(torch.cuda.nccl.version())"
执行有结果：(2, 14, 3)

locate nccl| grep "libnccl.so" | tail -n1 | sed -r 's/^.*\.so\.//'
执行有结果：2

我在用 transformers 进行训练的时候:

CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/.conda/envs/3.9/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...

@Fword4u 你好，我这边检查的环境是这样，实在看不出来哪里环境配置有冲突；

diaojunxian commented 1 year ago

pip install bmtrain --no-cache-dir

我执行这个 pip install bmtrain --no-cache-dir现在不报错了，想知道原因；

OpenBMB / CPM-Bee

import error, undefined symbol: ncclBroadcast #18

26