OpenBMB / BMCook

Model Compression for Big Models
Apache License 2.0
151 stars 21 forks source link

ImportError: /home/miniconda3/envs/BMCook/lib/python3.10/site-packages/bmtrain/nccl/_C.cpython-310-x86_64-linux-gnu.so: undefined symbol: ncclBroadcast ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 17198) of binary: /home/miniconda3/envs/BMCook/bin/python #23

Open wln20 opened 1 year ago

wln20 commented 1 year ago

Hi, I encountered the error described in the title of this issue, while trying to run the gpt-2 example. Here is my command:

export CUDA_VISIBLE_DEVICES=7
torchrun --nnodes=1 --nproc_per_node=1 --rdzv_id=1 --rdzv_backend=c10d --rdzv_endpoint=localhost ./gpt2_test.py \
    --model gpt2-base \
    --save-dir results/gpt2-prune \
    --data-path ... \
    --cook-config configs/gpt2-prune.json \

It seems that this is an error within the package bmtrain, so could you help figure out what happened or how to avoid it? Thanks a lot!

gongbaitao commented 1 year ago

Sorry for the delay! This is probably a CUDA version mismatch, so you'd better check it. Generally, CUDA 11 will work normally.

sjcfr commented 1 year ago

my cuda version is 11.7 and I'm still suffering this issue. Why insist on using this annoying package bmtrain?

diaojunxian commented 1 year ago

my cuda version is 11.7 and I'm still suffering this issue. Why insist on using this annoying package bmtrain?

I also encountered this problem, link https://github.com/OpenBMB/CPM-Bee/issues/18, and it can not reslove.