SJTU-IPADS / PowerInfer

High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
MIT License
7.9k stars 406 forks source link

在A100-80G上无法找到cuda的情况 #182

Open bulaikexiansheng opened 5 months ago

bulaikexiansheng commented 5 months ago

你好,我在A100-80G机器上复现powerinfer,但是遇到了以下的错误,看起来貌似是没有检测出机器上的i显卡?

机器的cuda版本:12.4

(base) turbo@sma100-02:/home/turbo/projects/PowerInfer$ ./build/bin/main -m /home/turbo/models/ReluLLaMA-70B/llama-70b-relu.q4.powerinfer.gguf -n 128 -t 8 -p "Once upon a time" --ignore-eos Log start main: build = 1578 (906830b) main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu main: seed = 1713950721 llama_model_loader: loaded meta data with 23 key-value pairs and 883 tensors from /home/turbo/models/ReluLLaMA-70B/llama-70b-relu.q4.powerinfer.gguf (version GGUF V3 (latest)) llama_model_loader: - tensor 0: token_embd.weight q4_0 [ 8192, 32000, 1, 1 ] llama_model_loader: - tensor 1: blk.0.attn_q.weight q4_0 [ 8192, 8192, 1, 1 ] llama_model_loader: - tensor 2: blk.0.attn_k.weight q4_0 [ 8192, 1024, 1, 1 ] ... llama_model_loader: - kv 0: general.architecture str llama_model_loader: - kv 1: general.name str ... llama_model_loader: - type f32: 161 tensors llama_model_loader: - type q4_0: 722 tensors llama_model_load: PowerInfer model loaded. Sparse inference will be used. llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 2048 llm_load_print_meta: n_embd = 8192 llm_load_print_meta: n_head = 64 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 80 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 28672 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 2048 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 70B llm_load_print_meta: model ftype = mostly Q4_0 llm_load_print_meta: model params = 74.98 B llm_load_print_meta: model size = 39.28 GiB (4.50 BPW) llm_load_print_meta: general.name = nvme llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: PAD token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_print_meta: sparse_pred_threshold = 0.00 error loading model: CUDA is not loaded llama_load_model_from_file_with_context: failed to load model llama_init_from_gpt_params: error: failed to load model '/home/turbo/models/ReluLLaMA-70B/llama-70b-relu.q4.powerinfer.gguf' main: error: unable to load model

我在编译阶段的输出是: (base) turbo@sma100-02:/home/turbo/projects/PowerInfer$ cmake -S . -B build -DLLAMA_CUBLAS=ON -- The C compiler identification is GNU 11.4.0 -- The CXX compiler identification is GNU 11.4.0 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /usr/bin/cc - skipped -- Detecting C compile features -- Detecting C compile features - done -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /usr/bin/c++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Looking for pthread.h -- Looking for pthread.h - found -- Performing Test CMAKE_HAVE_LIBC_PTHREAD -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success -- Found Threads: TRUE -- Found CUDAToolkit: /usr/local/cuda/include (found version "12.4.99") -- cuBLAS found -- The CUDA compiler identification is NVIDIA 11.5.119 -- Detecting CUDA compiler ABI info -- Detecting CUDA compiler ABI info - done -- Check for working CUDA compiler: /usr/bin/nvcc - skipped -- Detecting CUDA compile features -- Detecting CUDA compile features - done -- Using CUDA architectures: 52;61;70 GNU ld (GNU Binutils for Ubuntu) 2.38 -- CMAKE_SYSTEM_PROCESSOR: x86_64 -- x86 detected -- Configuring done -- Generating done -- Build files have been written to: /home/turbo/projects/PowerInfer/build

``

bulaikexiansheng commented 5 months ago

抱歉,我提供的日志看起来很凌乱,下面可能会清楚一些:

(base) turbo@sma100-02:/home/turbo/projects/PowerInfer$ ./build/bin/main -m /home/turbo/models/ReluLLaMA-70B/llama-70b-relu.q4.powerinfer.gguf -n 128 -t 8 -p "Once upon a time" --ignore-eos

Log start main: build = 1578 (906830b) main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu main: seed = 1713951199 llama_model_loader: loaded meta data with 23 key-value pairs and 883 tensors from /home/turbo/models/ReluLLaMA-70B/llama-70b-relu.q4.powerinfer.gguf (version GGUF V3 (latest)) llama_model_loader: - tensor 0: token_embd.weight q4_0 [ 8192, 32000, 1, 1 ] llama_model_loader: - tensor 1: blk.0.attn_q.weight q4_0 [ 8192, 8192, 1, 1 ] llama_model_loader: - tensor 2: blk.0.attn_k.weight q4_0 [ 8192, 1024, 1, 1 ] llama_model_loader: - tensor 3: blk.0.attn_v.weight q4_0 [ 8192, 1024, 1, 1 ] llama_model_loader: - tensor 4: blk.0.attn_output.weight q4_0 [ 8192, 8192, 1, 1 ] llama_model_loader: - tensor 5: blk.0.ffn_gate.weight q4_0 [ 8192, 28672, 1, 1 ] llama_model_loader: - tensor 6: blk.0.ffn_up.weight q4_0 [ 8192, 28672, 1, 1 ] llama_model_loader: - tensor 7: blk.0.ffn_down_t.weight q4_0 [ 8192, 28672, 1, 1 ] llama_model_loader: - tensor 8: blk.0.attn_norm.weight f32 [ 8192, 1, 1, 1 ] llama_model_loader: - tensor 9: blk.0.ffn_norm.weight f32 [ 8192, 1, 1, 1 ] ...... llama_model_loader: - kv 0: general.architecture str llama_model_loader: - kv 1: general.name str llama_model_loader: - kv 2: llama.context_length u32 llama_model_loader: - kv 3: llama.embedding_length u32 llama_model_loader: - kv 4: llama.block_count u32 llama_model_loader: - kv 5: llama.feed_forward_length u32 llama_model_loader: - kv 6: llama.rope.dimension_count u32 llama_model_loader: - kv 7: llama.attention.head_count u32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 llama_model_loader: - kv 10: llama.rope.freq_base f32 llama_model_loader: - kv 11: general.file_type u32 llama_model_loader: - kv 12: tokenizer.ggml.model str llama_model_loader: - kv 13: tokenizer.ggml.tokens arr llama_model_loader: - kv 14: tokenizer.ggml.scores arr llama_model_loader: - kv 15: tokenizer.ggml.token_type arr llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool llama_model_loader: - kv 22: general.quantization_version u32 llama_model_loader: - type f32: 161 tensors llama_model_loader: - type q4_0: 722 tensors llama_model_load: PowerInfer model loaded. Sparse inference will be used. llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 2048 llm_load_print_meta: n_embd = 8192 llm_load_print_meta: n_head = 64 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 80 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 28672 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 2048 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 70B llm_load_print_meta: model ftype = mostly Q4_0 llm_load_print_meta: model params = 74.98 B llm_load_print_meta: model size = 39.28 GiB (4.50 BPW) llm_load_print_meta: general.name = nvme llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: PAD token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_print_meta: sparse_pred_threshold = 0.00 error loading model: CUDA is not loaded llama_load_model_from_file_with_context: failed to load model llama_init_from_gpt_params: error: failed to load model '/home/turbo/models/ReluLLaMA-70B/llama-70b-relu.q4.powerinfer.gguf' main: error: unable to load model

(base) turbo@sma100-02:/home/turbo/projects/PowerInfer$ cmake -S . -B build -DLLAMA_CUBLAS=ON

-- The C compiler identification is GNU 11.4.0 -- The CXX compiler identification is GNU 11.4.0 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /usr/bin/cc - skipped -- Detecting C compile features -- Detecting C compile features - done -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /usr/bin/c++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Looking for pthread.h -- Looking for pthread.h - found -- Performing Test CMAKE_HAVE_LIBC_PTHREAD -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success -- Found Threads: TRUE -- Found CUDAToolkit: /usr/local/cuda/include (found version "12.4.99") -- cuBLAS found -- The CUDA compiler identification is NVIDIA 11.5.119 -- Detecting CUDA compiler ABI info -- Detecting CUDA compiler ABI info - done -- Check for working CUDA compiler: /usr/bin/nvcc - skipped -- Detecting CUDA compile features -- Detecting CUDA compile features - done -- Using CUDA architectures: 52;61;70 GNU ld (GNU Binutils for Ubuntu) 2.38 -- CMAKE_SYSTEM_PROCESSOR: x86_64 -- x86 detected -- Configuring done -- Generating done -- Build files have been written to: /home/turbo/projects/PowerInfer/build

hodlen commented 5 months ago

CMake的输出代表你的环境中有CUDA编译工具链,但运行时报错“CUDA is not loaded”,这种情况可能是驱动没有正常加载。可以尝试一下在同一个环境下运行 nvidia-smi 查看是否能够检测到GPU,如果同样会报错就可以确定是驱动的问题。

如果你的显卡硬件正常,驱动安装正确,这种临时问题通常可以通过重启机器或容器解决。