Isotr0py / SakuraLLM-Notebooks

Notebooks to run SakuraLLM on colab/kaggle
47 stars 4 forks source link

在kaggle中使用P100时出现错误 #8

Open wsndshx opened 3 months ago

wsndshx commented 3 months ago

闲来无事想试试P100的推理速度

在装载模型的时候出现错误:

(…)kura-14b-qwen2beta-v0.9-iq4_xs_ver2.gguf: 100%
7.85G/7.85G [00:39<00:00, 227MB/s]
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]
2024-04-25 03:46:58 81b669f4250b __main__[143] WARNING Auth is disabled!
2024-04-25 03:46:58 81b669f4250b __main__[143] INFO Current server config: Server(listen: 127.0.0.1:5000, auth: None:None)
2024-04-25 03:46:58 81b669f4250b __main__[143] INFO Current model config: SakuraModelConfig(model_name_or_path='./models/sakura-14b-qwen2beta-v0.9-iq4_xs_ver2.gguf', use_gptq_model=False, use_awq_model=False, trust_remote_code=True, text_length=512, llama=False, llama_cpp=True, use_gpu=True, n_gpu_layers=0, vllm=False, enforce_eager=False, tensor_parallel_size=1, gpu_memory_utilization=0.9, ollama=False, model_name=None, model_quant=None, model_version='0.9')
2024-04-25 03:47:01.878511: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-25 03:47:01.878632: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-25 03:47:02.012167: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-25 03:47:11 81b669f4250b numexpr.utils[143] INFO NumExpr defaulting to 4 threads.
2024-04-25 03:47:11 81b669f4250b utils.model[143] INFO loading model ...
llama_model_loader: loaded meta data with 20 key-value pairs and 483 tensors from ./models/sakura-14b-qwen2beta-v0.9-iq4_xs_ver2.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.name str              = 0406_qwen2beta_14b_base_1024_pted_new_v1
llama_model_loader: - kv   2:                          qwen2.block_count u32              = 40
llama_model_loader: - kv   3:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   4:                     qwen2.embedding_length u32              = 5120
llama_model_loader: - kv   5:                  qwen2.feed_forward_length u32              = 13696
llama_model_loader: - kv   6:                 qwen2.attention.head_count u32              = 40
llama_model_loader: - kv   7:              qwen2.attention.head_count_kv u32              = 40
llama_model_loader: - kv   8:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv   9:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                          general.file_type u32              = 30
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  13:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  14:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  15:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  16:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  18:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  201 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq4_nl:   40 tensors
llama_model_loader: - type iq4_xs:  241 tensors
llm_load_vocab: special tokens definition check successful ( 421/152064 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 40
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 5120
llm_load_print_meta: n_embd_v_gqa     = 5120
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 13696
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 13B
llm_load_print_meta: model ftype      = IQ4_XS - 4.25 bpw
llm_load_print_meta: model params     = 14.17 B
llm_load_print_meta: model size       = 7.30 GiB (4.43 BPW) 
llm_load_print_meta: general.name     = 0406_qwen2beta_14b_base_1024_pted_new_v1
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
llm_load_tensors: ggml ctx size =    0.46 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors:        CPU buffer size =   394.45 MiB
llm_load_tensors:      CUDA0 buffer size =  7084.89 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1600.00 MiB
llama_new_context_with_model: KV self size  = 1600.00 MiB, K (f16):  800.00 MiB, V (f16):  800.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   307.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    14.01 MiB
llama_new_context_with_model: graph nodes  = 1406
llama_new_context_with_model: graph splits = 2
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | 
Model metadata: {'tokenizer.ggml.bos_token_id': '151643', 'general.architecture': 'qwen2', 'general.name': '0406_qwen2beta_14b_base_1024_pted_new_v1', 'qwen2.block_count': '40', 'qwen2.context_length': '32768', 'tokenizer.chat_template': "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}", 'qwen2.attention.head_count_kv': '40', 'tokenizer.ggml.padding_token_id': '151643', 'qwen2.embedding_length': '5120', 'qwen2.attention.layer_norm_rms_epsilon': '0.000001', 'qwen2.attention.head_count': '40', 'tokenizer.ggml.eos_token_id': '151645', 'qwen2.rope.freq_base': '1000000.000000', 'general.file_type': '30', 'general.quantization_version': '2', 'qwen2.feed_forward_length': '13696', 'tokenizer.ggml.model': 'gpt2'}
Using gguf chat template: {% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
You are a helpful assistant<|im_end|>
' }}{% endif %}{{'<|im_start|>' + message['role'] + '
' + message['content'] + '<|im_end|>' + '
'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
' }}{% endif %}
Using chat eos_token: <|im_end|>
Using chat bos_token: <|endoftext|>
GGML_ASSERT: /home/runner/work/llama-cpp-python/llama-cpp-python/vendor/llama.cpp/ggml-cuda/dmmv.cu:804: false

使用的代码

# ngrokToken留空则使用localtunnel进行内网穿透
ngrokToken = ""
use_pinggy = True
MODEL = "sakura-14b-qwen2beta-v0.9-iq4_xs_ver2"

from huggingface_hub import hf_hub_download
from pathlib import Path

if ngrokToken:
    from pyngrok import conf, ngrok
    conf.get_default().auth_token = ngrokToken
    conf.get_default().monitor_thread = False
    ssh_tunnels = ngrok.get_tunnels(conf.get_default())
    if len(ssh_tunnels) == 0:
        ssh_tunnel = ngrok.connect(5000)
        print('address:'+ssh_tunnel.public_url)
    else:
        print('address:'+ssh_tunnels[0].public_url)
elif use_pinggy:
    import subprocess
    import threading
    def start_pinggy(port):
        cmd = f"ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -p 443 -R0:localhost:{port} a.pinggy.io"
        p = subprocess.Popen(cmd.split(" "), stdout=subprocess.PIPE)
        for line in p.stdout:
            print(line.decode(), end='')
    threading.Thread(target=start_pinggy, daemon=True, args=(5000,)).start()
else:
    import subprocess
    import threading
    def start_localtunnel(port):
        p = subprocess.Popen(["lt", "--port", f"{port}"], stdout=subprocess.PIPE)
        for line in p.stdout:
            print(line.decode(), end='')
    threading.Thread(target=start_localtunnel, daemon=True, args=(5000,)).start()

MODEL_PATH = f"./models/{MODEL}.gguf"
if not Path(MODEL_PATH).exists():
    hf_hub_download(repo_id="SakuraLLM/Sakura-14B-Qwen2beta-v0.9-GGUF", filename=f"{MODEL}.gguf", local_dir="models/")

!python server.py \
    --model_name_or_path $MODEL_PATH \
    --llama_cpp \
    --use_gpu \
    --model_version 0.9 \
    --trust_remote_code \
    --no-auth

看起来只能用 T4 * 2 跑了

Isotr0py commented 3 months ago

这个可能是 llama.cpp 的 bug,imatrix 量化的模型在 P100 上无法运行,但至少 k-quants (Q6_K) 在 P100 上是能跑的。 (可能是 P100 的 compute capability 太低了?)

Isotr0py commented 3 months ago

目前好像只有 [mmvq kernel]() 支持了 imatrix, P100 用的 dmmv kernel 还没有支持 imatrix:

    switch (src0->type) {
        case GGML_TYPE_Q4_0:
            dequantize_mul_mat_vec_q4_0_cuda(src0_dd_i, src1_dfloat, dst_dd_i, ne00, row_diff, stream);
            break;
        case GGML_TYPE_Q4_1:
            dequantize_mul_mat_vec_q4_1_cuda(src0_dd_i, src1_dfloat, dst_dd_i, ne00, row_diff, stream);
            break;
        case GGML_TYPE_Q5_0:
            dequantize_mul_mat_vec_q5_0_cuda(src0_dd_i, src1_dfloat, dst_dd_i, ne00, row_diff, stream);
            break;
        case GGML_TYPE_Q5_1:
            dequantize_mul_mat_vec_q5_1_cuda(src0_dd_i, src1_dfloat, dst_dd_i, ne00, row_diff, stream);
            break;
        case GGML_TYPE_Q8_0:
            dequantize_mul_mat_vec_q8_0_cuda(src0_dd_i, src1_dfloat, dst_dd_i, ne00, row_diff, stream);
            break;
        case GGML_TYPE_Q2_K:
            dequantize_mul_mat_vec_q2_K_cuda(src0_dd_i, src1_ddf_i, dst_dd_i, ne00, row_diff, stream);
            break;
        case GGML_TYPE_Q3_K:
            dequantize_mul_mat_vec_q3_K_cuda(src0_dd_i, src1_ddf_i, dst_dd_i, ne00, row_diff, stream);
            break;
        case GGML_TYPE_Q4_K:
            dequantize_mul_mat_vec_q4_K_cuda(src0_dd_i, src1_ddf_i, dst_dd_i, ne00, row_diff, stream);
            break;
        case GGML_TYPE_Q5_K:
            dequantize_mul_mat_vec_q5_K_cuda(src0_dd_i, src1_ddf_i, dst_dd_i, ne00, row_diff, stream);
            break;
        case GGML_TYPE_Q6_K:
            dequantize_mul_mat_vec_q6_K_cuda(src0_dd_i, src1_ddf_i, dst_dd_i, ne00, row_diff, stream);
            break;
        case GGML_TYPE_F16:
            convert_mul_mat_vec_f16_cuda(src0_dd_i, src1_dfloat, dst_dd_i, ne00, row_diff, stream);
            break;
        default:
            GGML_ASSERT(false);
            break;
    }
Isotr0py commented 3 months ago

👀 Same issue in llama.cpp, maybe P100 device issue since P40 works: