WangRongsheng / XrayGLM

🩺 首个会看胸部X光片的中文多模态医学大模型 | The first Chinese Medical Multimodal Model that Chest Radiographs Summarization.
Other
842 stars 119 forks source link

本地搭建问题 #14

Closed xiaoli26 closed 1 year ago

xiaoli26 commented 1 year ago

使用谷歌COLAB按照主页教程搭建成功 之后在本地的WSL2-Ubuntu-22.04上尝试搭建XrayGLM,运行后报错

第一次报错: lsj@DESKTOP-H1KB736:/mnt/c/Users/38561/xrayglm$ python cli_demo.py --from_pretrained checkpoints/checkpoints-XrayGLM-300 --prompt_zh '详细描述这张胸部X光片的诊断结果'

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so /home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. warn("The installed version of bitsandbytes was compiled without GPU support. " /home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32 /home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /home/lsj/.conda/envs/xrayglm did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) /home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/cuda-11.6/lib64}')} warn(msg) /home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /usr/local/cuda-11.6/lib64} did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths... /home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward. Either way, this might cause trouble in the future: If you get CUDA error: invalid device function errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env. warn(msg) CUDA SETUP: WARNING! libcuda.so not found! Do you have a CUDA driver installed? If you are on a cluster, make sure you are on a CUDA machine! CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so /home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: No GPU detected! Check your CUDA paths. Proceeding to load CPU-only library... warn(msg) CUDA SETUP: Loading binary /home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so... [2023-05-31 16:25:18,755] [WARNING] Failed to load bitsandbytes:No module named 'scipy' [2023-05-31 16:25:18,763] [INFO] building FineTuneVisualGLMModel model ... 40901 [2023-05-31 16:25:18,845] [INFO] [RANK 0] > initializing model parallel with size 1 [2023-05-31 16:25:18,846] [INFO] [RANK 0] You are using model-only mode. For torch.distributed users or loading model parallel models, set environment variables RANK, WORLD_SIZE and LOCAL_RANK. Traceback (most recent call last): File "/mnt/c/Users/38561/xrayglm/cli_demo.py", line 104, in main() File "/mnt/c/Users/38561/xrayglm/cli_demo.py", line 30, in main model, model_args = AutoModel.from_pretrained( File "/home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/sat/model/base_model.py", line 282, in from_pretrained model = get_model(args, model_cls, kwargs) File "/home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/sat/model/base_model.py", line 305, in get_model model = model_cls(args, params_dtype=params_dtype, kwargs) File "/mnt/c/Users/38561/xrayglm/finetune_XrayGLM.py", line 13, in init super().init(args, transformer=transformer, parallel_output=parallel_output, kw_args) File "/mnt/c/Users/38561/xrayglm/model/visualglm.py", line 29, in init super().init(args, transformer=transformer, kwargs) File "/home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/sat/model/official/chatglm_model.py", line 170, in init super(ChatGLMModel, self).init(args, transformer=transformer, activation_func=gelu, **kwargs) File "/home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/sat/model/base_model.py", line 88, in init self.transformer = BaseTransformer( File "/home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/sat/model/transformer.py", line 427, in init [get_layer(layer_id) for layer_id in range(num_layers)]) File "/home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/sat/model/transformer.py", line 427, in [get_layer(layer_id) for layer_id in range(num_layers)]) File "/home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/sat/model/transformer.py", line 402, in get_layer return BaseTransformerLayer( File "/home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/sat/model/transformer.py", line 313, in init self.mlp = MLP( File "/home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/sat/model/transformer.py", line 189, in init self.dense_h_to_4h = ColumnParallelLinear( File "/home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/sat/mpu/layers.py", line 219, in init self.weight = Parameter(torch.empty(self.output_size_per_partition, torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 12.00 GiB total capacity; 11.25 GiB already allocated; 0 bytes free; 11.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

看似显存溢出了,把模型改为quant 4,第二次报错:

(xrayglm) lsj@DESKTOP-H1KB736:/mnt/c/Users/38561/xrayglm$ python cli_demo.py --quant 4 --from_pretrained checkpoints/che ckpoints-XrayGLM-300 --prompt_zh '详细描述这张胸部X光片的诊断结果'

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so /home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. warn("The installed version of bitsandbytes was compiled without GPU support. " /home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32 /home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /home/lsj/.conda/envs/xrayglm did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) /home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/cuda-11.6/lib64}')} warn(msg) /home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /usr/local/cuda-11.6/lib64} did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths... /home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward. Either way, this might cause trouble in the future: If you get CUDA error: invalid device function errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env. warn(msg) CUDA SETUP: WARNING! libcuda.so not found! Do you have a CUDA driver installed? If you are on a cluster, make sure you are on a CUDA machine! CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so /home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: No GPU detected! Check your CUDA paths. Proceeding to load CPU-only library... warn(msg) CUDA SETUP: Loading binary /home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so... [2023-05-31 16:32:29,588] [WARNING] Failed to load bitsandbytes:No module named 'scipy' [2023-05-31 16:32:29,593] [INFO] building FineTuneVisualGLMModel model ... 42795 [2023-05-31 16:32:29,645] [INFO] [RANK 0] > initializing model parallel with size 1 [2023-05-31 16:32:29,647] [INFO] [RANK 0] You are using model-only mode. For torch.distributed users or loading model parallel models, set environment variables RANK, WORLD_SIZE and LOCAL_RANK. /home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") replacing layer 0 with lora replacing layer 14 with lora [2023-05-31 16:32:55,759] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 7811237376 [2023-05-31 16:32:59,754] [INFO] [RANK 0] global rank 0 is loading checkpoint checkpoints/checkpoints-XrayGLM-300/300/mp_rank_00_model_states.pt Killed

好像没装scipy: pip install scipy Collecting scipy Downloading scipy-1.10.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.5 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34.5/34.5 MB 6.1 MB/s eta 0:00:00 Requirement already satisfied: numpy<1.27.0,>=1.19.5 in /home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages (from scipy) (1.24.3) Installing collected packages: scipy Successfully installed scipy-1.10.1

第三次报错: (xrayglm) lsj@DESKTOP-H1KB736:/mnt/c/Users/38561/xrayglm$ python cli_demo.py --quant 4 --from_pretrained checkpoints/checkpoints-XrayGLM-300 --prompt_zh '详细描述这张胸部X光片的诊断结果'

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so /home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. warn("The installed version of bitsandbytes was compiled without GPU support. " /home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32 /home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /home/lsj/.conda/envs/xrayglm did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) /home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/cuda-11.6/lib64}')} warn(msg) /home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /usr/local/cuda-11.6/lib64} did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths... /home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward. Either way, this might cause trouble in the future: If you get CUDA error: invalid device function errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env. warn(msg) CUDA SETUP: WARNING! libcuda.so not found! Do you have a CUDA driver installed? If you are on a cluster, make sure you are on a CUDA machine! CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so /home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: No GPU detected! Check your CUDA paths. Proceeding to load CPU-only library... warn(msg) CUDA SETUP: Loading binary /home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so... [2023-05-31 16:36:46,280] [INFO] building FineTuneVisualGLMModel model ... 60615 [2023-05-31 16:36:46,285] [INFO] [RANK 0] > initializing model parallel with size 1 [2023-05-31 16:36:46,287] [INFO] [RANK 0] You are using model-only mode. For torch.distributed users or loading model parallel models, set environment variables RANK, WORLD_SIZE and LOCAL_RANK. /home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") replacing layer 0 with lora replacing layer 14 with lora [2023-05-31 16:36:53,258] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 7811237376 [2023-05-31 16:36:53,886] [INFO] [RANK 0] global rank 0 is loading checkpoint checkpoints/checkpoints-XrayGLM-300/300/mp_rank_00_model_states.pt Killed

nvcc -V是可以看得到cuda的,好像是bitsandbytes的问题,我按照https://blog.csdn.net/anycall201/article/details/129930919修改了一下 最后还是被killed: (xrayglm) lsj@DESKTOP-H1KB736:/mnt/c/Users/38561/XrayGLM$ python cli_demo.py --quant 4 --from_pretrained checkpoints/checkpoints-XrayGLM-300 --prompt_zh '详细描述这张胸部X光片的诊断结果'

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda116.so CUDA SETUP: Loading binary /home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda116.so... [2023-05-31 16:48:33,857] [INFO] building FineTuneVisualGLMModel model ... 60827 [2023-05-31 16:48:33,862] [INFO] [RANK 0] > initializing model parallel with size 1 [2023-05-31 16:48:33,864] [INFO] [RANK 0] You are using model-only mode. For torch.distributed users or loading model parallel models, set environment variables RANK, WORLD_SIZE and LOCAL_RANK. /home/lsj/.conda/envs/xrayglm/lib/python3.9/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") replacing layer 0 with lora replacing layer 14 with lora [2023-05-31 16:48:40,797] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 7811237376 [2023-05-31 16:48:42,470] [INFO] [RANK 0] global rank 0 is loading checkpoint checkpoints/checkpoints-XrayGLM-300/300/mp_rank_00_model_states.pt Killed

求解答,谢谢!

WangRongsheng commented 1 year ago

或许您可以提供一下GPU、环境信息,我这里测试是没有问题的。

xiaoli26 commented 1 year ago

+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.43.02 Driver Version: 535.98 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 3080 Ti On | 00000000:02:00.0 On | N/A | | 30% 35C P8 32W / 350W | 850MiB / 12288MiB | 26% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0

Package Version


aiofiles 23.1.0 aiohttp 3.8.4 aiosignal 1.3.1 altair 5.0.1 anyio 3.7.0 async-timeout 4.0.2 attrs 23.1.0 bitsandbytes 0.39.0 certifi 2023.5.7 charset-normalizer 3.1.0 click 8.1.3 cmake 3.26.3 contourpy 1.0.7 cpm-kernels 1.0.11 cycler 0.11.0 datasets 2.12.0 deepspeed 0.9.2 dill 0.3.6 einops 0.6.1 exceptiongroup 1.1.1 fastapi 0.95.2 ffmpy 0.3.0 filelock 3.12.0 fonttools 4.39.4 frozenlist 1.3.3 fsspec 2023.5.0 gradio 3.33.0 gradio_client 0.2.5 h11 0.14.0 hjson 3.1.0 httpcore 0.17.2 httpx 0.24.1 huggingface-hub 0.15.1 idna 3.4 Jinja2 3.1.2 jsonschema 4.17.3 kiwisolver 1.4.4 latex2mathml 3.76.0 linkify-it-py 2.0.2 lit 16.0.5 Markdown 3.4.3 markdown-it-py 2.2.0 MarkupSafe 2.1.2 matplotlib 3.7.1 mdit-py-plugins 0.3.3 mdtex2html 1.2.0 mdurl 0.1.2 mpmath 1.3.0 multidict 6.0.4 multiprocess 0.70.14 networkx 3.1 ninja 1.11.1 numpy 1.24.3 nvidia-cublas-cu11 11.10.3.66 nvidia-cuda-cupti-cu11 11.7.101 nvidia-cuda-nvrtc-cu11 11.7.99 nvidia-cuda-runtime-cu11 11.7.99 nvidia-cudnn-cu11 8.5.0.96 nvidia-cufft-cu11 10.9.0.58 nvidia-curand-cu11 10.2.10.91 nvidia-cusolver-cu11 11.4.0.1 nvidia-cusparse-cu11 11.7.4.91 nvidia-nccl-cu11 2.14.3 nvidia-nvtx-cu11 11.7.91 orjson 3.9.0 packaging 23.1 pandas 2.0.2 Pillow 9.5.0 pip 23.0.1 protobuf 3.20.3 psutil 5.9.5 py-cpuinfo 9.0.0 pyarrow 12.0.0 pydantic 1.10.8 pydub 0.25.1 Pygments 2.15.1 pyparsing 3.0.9 PyQt5 5.15.9 PyQt5-Qt5 5.15.2 PyQtWebEngine 5.15.6 PyQtWebEngine-Qt5 5.15.2 pyrsistent 0.19.3 python-dateutil 2.8.2 python-multipart 0.0.6 pytz 2023.3 PyYAML 6.0 regex 2023.5.5 requests 2.31.0 responses 0.18.0 scipy 1.10.1 semantic-version 2.10.0 sentencepiece 0.1.99 setuptools 67.8.0 six 1.16.0 sniffio 1.3.0 starlette 0.27.0 SwissArmyTransformer 0.3.7 sympy 1.12 tensorboardX 2.6 tokenizers 0.13.3 toolz 0.12.0 torch 2.0.1+cu118 torchaudio 2.0.2+cu118 torchvision 0.15.2+cu118 tqdm 4.65.0 transformers 4.29.2 triton 2.0.0 typing_extensions 4.6.2 tzdata 2023.3 uc-micro-py 1.0.2 urllib3 2.0.2 uvicorn 0.22.0 websockets 11.0.3 wheel 0.38.4 xxhash 3.2.0 yarl 1.9.2

(xrayglm) lsj@DESKTOP-H1KB736:/mnt/c/Users/38561/xrayglm$ python cli_demo.py --quant 4 --from_pretrained checkpoints/checkpoints-XrayGLM-300 --prompt_zh '详细描述这张胸部X光片的诊断结果'

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /home/lsj/.conda/envs/xrayglm/lib/python3.10/site-packages/bitsandbytes-0.39.0-py3.10.egg/bitsandbytes/libbitsandbytes_cuda118.so /home/lsj/.conda/envs/xrayglm/lib/python3.10/site-packages/bitsandbytes-0.39.0-py3.10.egg/bitsandbytes/cuda_setup/main.py:149: UserWarning: /home/lsj/.conda/envs/xrayglm did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.8/lib64/libcudart.so.11.0 CUDA SETUP: Highest compute capability among GPUs detected: 8.6 CUDA SETUP: Detected CUDA version 118 CUDA SETUP: Loading binary /home/lsj/.conda/envs/xrayglm/lib/python3.10/site-packages/bitsandbytes-0.39.0-py3.10.egg/bitsandbytes/libbitsandbytes_cuda118.so... [2023-06-02 21:00:20,419] [INFO] building FineTuneVisualGLMModel model ... [2023-06-02 21:00:20,420] [INFO] [RANK 0] > initializing model parallel with size 1 [2023-06-02 21:00:20,421] [INFO] [RANK 0] You are using model-only mode. For torch.distributed users or loading model parallel models, set environment variables RANK, WORLD_SIZE and LOCAL_RANK. /home/lsj/.conda/envs/xrayglm/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") replacing layer 0 attention with lora replacing layer 14 attention with lora [2023-06-02 21:00:30,567] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 7811237376 [2023-06-02 21:00:37,716] [INFO] [RANK 0] global rank 0 is loading checkpoint checkpoints/checkpoints-XrayGLM-300/300/mp_rank_00_model_states.pt Killed

这次添加了export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/wsl/lib/ 已经没有任何报错了,但是还是在读取模型的时候killed,不知道是否与WSL2不是原生Linux有关。windows版按照教程装还是提示没有deepspeed。目前只有在colab上成功架设。

xiaoli26 commented 1 year ago

检索了一下好像是说Linux的虚拟内存不够导致进程被杀掉了,这台电脑只有16G内存,过几天换台服务器试一下T T