abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
7.85k stars 938 forks source link

[regression] embeddings working on 0.2.55 but not on 0.2.56 #1269

Closed thiswillbeyourgithub closed 5 months ago

thiswillbeyourgithub commented 6 months ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Expected Behavior

As seen on 0.2.55:

Writing the same command as below returns an embedding.

Current Behavior

As seen on 0.2.56

from pathlib import Path ; import llama_cpp ; llm = llama_cpp.Llama(model_path=Path("gemma-2b-q4_K_M.gguf").absolute().__str__(), embedding=True) ; llm.create_embedding("Hi")
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1070, compute capability 6.1, VMM: yes
llama_model_loader: loaded meta data with 21 key-value pairs and 164 tensors from /home/REDACTED/Desktop/REDACTED/gemma_gguf/gemma-2b-q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma
llama_model_loader: - kv   1:                               general.name str              = gemma-2b
llama_model_loader: - kv   2:                       gemma.context_length u32              = 8192
llama_model_loader: - kv   3:                          gemma.block_count u32              = 18
llama_model_loader: - kv   4:                     gemma.embedding_length u32              = 2048
llama_model_loader: - kv   5:                  gemma.feed_forward_length u32              = 16384
llama_model_loader: - kv   6:                 gemma.attention.head_count u32              = 8
llama_model_loader: - kv   7:              gemma.attention.head_count_kv u32              = 1
llama_model_loader: - kv   8:                 gemma.attention.key_length u32              = 256
llama_model_loader: - kv   9:               gemma.attention.value_length u32              = 256
llama_model_loader: - kv  10:     gemma.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  13:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  14:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  15:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,256128]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,256128]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,256128]  = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - kv  20:                          general.file_type u32              = 15
llama_model_loader: - type  f32:   37 tensors
llama_model_loader: - type q4_K:  108 tensors
llama_model_loader: - type q6_K:   19 tensors
llm_load_vocab: mismatch in special tokens definition ( 544/256128 vs 388/256128 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = gemma
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 256128                                                                                       llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 8192                                                                                         llm_load_print_meta: n_embd           = 2048                                                                                         llm_load_print_meta: n_head           = 8                                                                                            llm_load_print_meta: n_head_kv        = 1                                                                                            llm_load_print_meta: n_layer          = 18
llm_load_print_meta: n_rot            = 256
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 256
llm_load_print_meta: n_embd_v_gqa     = 256
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 16384
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 2B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 2.51 B
llm_load_print_meta: model size       = 1.51 GiB (5.18 BPW)
llm_load_print_meta: general.name     = gemma-2b
llm_load_print_meta: BOS token        = 2 '<bos>'
llm_load_print_meta: EOS token        = 1 '<eos>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: PAD token        = 0 '<pad>'
llm_load_print_meta: LF token         = 227 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.06 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/19 layers to GPU
llm_load_tensors:        CPU buffer size =  1549.19 MiB
.........................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =     9.00 MiB
llama_new_context_with_model: KV self size  =    9.00 MiB, K (f16):    4.50 MiB, V (f16):    4.50 MiB
llama_new_context_with_model:  CUDA_Host input buffer size   =     6.01 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   504.25 MiB
llama_new_context_with_model: graph splits (measure): 1
AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
Model metadata: {'general.file_type': '15', 'tokenizer.ggml.unknown_token_id': '3', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.eos_token_id': '1', 'general.architecture': 'gemma', 'gemma.feed_forward_length': '16384', 'gemma.attention.head_count': '8', 'general.name': 'gemma-2b', 'gemma.context_length': '8192', 'gemma.block_count': '18', 'gemma.embedding_length': '2048', 'gemma.attention.head_count_kv': '1', 'gemma.attention.key_length': '256', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'gemma.attention.value_length': '256', 'gemma.attention.layer_norm_rms_epsilon': '0.000001', 'tokenizer.ggml.bos_token_id': '2'}
Using fallback chat format: None
[1]    400884 segmentation fault (core dumped)  python3 -c "import IPython, sys; sys.exit(IPython.start_ipython())"

Environment and Context

The command I use to switch versions: CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python==0.2.55 --no-cache-dir

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         36 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz
    CPU family:          6
    Model:               58
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            9
    CPU max MHz:         6300,0000
    CPU min MHz:         1600,0000
    BogoMIPS:            6984.38
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2
                         ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cp
                         uid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popc
                         nt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti ssbd ibrs ibpb stibp tpr_shadow
                          flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts vnmi md_clear flush_l1d
Virtualization features:
  Virtualization:        VT-x
Caches (sum of all):
  L1d:                   128 KiB (4 instances)
  L1i:                   128 KiB (4 instances)
  L2:                    1 MiB (4 instances)
  L3:                    8 MiB (1 instance)
NUMA:
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-7
Vulnerabilities:
  Gather data sampling:  Not affected
  Itlb multihit:         KVM: Mitigation: VMX disabled
  L1tf:                  Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
  Mds:                   Mitigation; Clear CPU buffers; SMT vulnerable
  Meltdown:              Mitigation; PTI
  Mmio stale data:       Unknown: No mitigations
  Retbleed:              Not affected
  Spec rstack overflow:  Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Vulnerable: No microcode
  Tsx async abort:       Not affected

Linux REDACTED-MS-7758 6.5.0-25-generic #25~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Feb 20 16:09:15 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

$ python3 --version : 3.10.12
$ make --version : 4.3
$ g++ --version : 11.4.0

Failure Information (for bugs)

The failure happens no matter what arguments I give when loading the model, happens also for the non quantized model, does not happen when loading the model for text generation but at least happens for embeddings.

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

  1. Get gemma 2b gguf
  2. Make sure you have version 0.2.56
  3. ipython : from pathlib import Path ; import llama_cpp ; llm = llama_cpp.Llama(model_path=Path("gemma-2b-q4_K_M.gguf").absolute().__str__(), embedding=True) ; llm.create_embedding("Hi")
  4. Notice the dump

I confirm this code is not happening when running ./embedding from llamacpp with as latest commit b3d978600f07f22e94f2e797f18a8b5f6df23c89.

I just need to use gemma for langchain embeddings so using 0.2.55 is fine to me, just making a heads up for devs :)

abetlen commented 6 months ago

@thiswillbeyourgithub thanks for reporting, it's related to #1263

The issue is a null pointer is returned from the new get_embeddings_seq if pooling type is not set, I've set it to mean now. @iamlemec does that sound correct?

iamlemec commented 6 months ago

I think the issue here is that gemma-2b is not an embedding model. This should work if you use something like bge-base-en-v1.5 (GGUF here). I think setting the default pooling to unspecified as in the recent commit is the right route. Ultimately, for the error message, you may want to say that the model doesn't support sequence embeddings, which will be the case when hparams.pooling_type is LLAMA_POOLING_TYPE_UNSPECIFIED.

Part of the problem is that the pooling layer is actually considered part of the model, not something that can be applied to arbitrary models ex post, though this could obviously change. So right now if you want to get embeddings from generative LLMs, you need to set LLAMA_POOLING_TYPE_NONE, use llama_get_embeddings_ith, and manually pool the token level embeddings however you'd like. We actually made an example that does this with GritLM, which is a dual use model that does both generation and embeddings (see examples/gritlm in llama.cpp).

abetlen commented 6 months ago

@iamlemec thanks, that makes sense. Would it then make sense to use pooling type unspecified by default then check if the result of get_embeddings_seq is null and if it is we use get_embeddings_ith?

iamlemec commented 6 months ago

Having unspecified as default makes sense. The issue with falling back to get_embeddings_ith is that it'll give you the ith token, not sequence. So in that case, I think you either need to just say "this model doesn't do embeddings" or implement pooling on the python side (basically first token or mean pooling). Another option would be to give the user a way to just get the token level embeddings and let them figure it out (which would be useful for ColBERT style approaches).

thiswillbeyourgithub commented 6 months ago

Thank you all. Can anyone tell me how I should proceed to get the embeddings of each token of a sentence ? I could then do the pooling myself at least

iamlemec commented 5 months ago

@thiswillbeyourgithub new update! With the latest code on main you can pass pooling_type=LLAMA_POOLING_TYPE_NONE to the constructor and it will then give you token level embeddings.

thiswillbeyourgithub commented 5 months ago

Thank you very much!

zabiullahss commented 3 months ago

@thiswillbeyourgithub new update! With the latest code on main you can pass pooling_type=LLAMA_POOLING_TYPE_NONE to the constructor and it will then give you token level embeddings.

Can this be used with LlamaCppEmbeddings ?