ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.16k stars 9.34k forks source link

Question: How to access feature vector of the intermediate layer of network? #2047

Closed sohta94 closed 5 months ago

sohta94 commented 1 year ago

Prerequisites

Expected Behavior

I am interested in the difference between the feature vectors of the intermediate layer of the llama.cpp and PyTorch versions of the LLaMa model. For this purpose, I would like to know how I can get the feature vectors of the middle layer, such as torchvision.models.feature_extraction.create_feature_extractor and register_forward_hook method in PyTorch.

Current Behavior

I browsed C++ programs but could not figure out how to get the feature vector.

SlyEcho commented 1 year ago

You can see some example of how to extract the vector in the embedding example, but it only extracts after the last layer and only the last vector.

There is also my experiment #1472 where I extract the input to an arbitrary layer. It's a little more complext because it extracts it by multiplying it with a coefficient (like +1.0 or -1.0) and can add. It also supports adding it back later on inference.

sohta94 commented 1 year ago

Thank you for your very informative comments. With reference to #1472, I cloned steering branch and try steering options as follows.

$ ./main -m ./models/7B/ggml-model-q4_0.bin --seed 123 -n 64   --steering-add "Love"   --steering-sub "Hate"   --steering-source 4   --steering-layer 4   --steering-mul 5   --prompt "I hate you because "
main: build = 0 (unknown)
main: seed  = 123
llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =   0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | 
main: steering: ('Love' - 'Hate') * 5.000000
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 64, n_keep = 0

 I hate you because 1) You are the best thing that has ever happened to me and 2) I'm going to be in your life for a long time.
Love this one so much!
I wish you had said "You are the only woman who has ever been in my life" instead of "You are
llama_print_timings:        load time = 60662.62 ms
llama_print_timings:      sample time =   134.56 ms /    64 runs   (    2.10 ms per token)
llama_print_timings: prompt eval time =  4729.86 ms /    12 tokens (  394.15 ms per token)
llama_print_timings:        eval time = 74006.17 ms /    63 runs   ( 1174.70 ms per token)
llama_print_timings:       total time = 139246.77 ms

I have additional questions about steering branch.

Best regards.

SlyEcho commented 1 year ago

The .bin file is just a dump of the floating point number, I think it's somewhere there but commented out. It can be read in easily with Numpy.

The steering processes the same model as normal and the layers are processed in a loop. The ggml models only have the weights of the models, the model definition is inside code only, in llama.cpp, the function llama_eval_internal(). As far as I understand the names of the weights don't match exactly with Pytorch models, you can see in convert.py how they are mapped.

If you want to add some numbers on a specific layer, you have to have a condition inside the loop that checks the layer number, then you add your numbers to the inpL, inpSA, cur or whatever vector you need using ggml operations. That means you have to read you data into a ggml vector as well in the beginning.

Anyway, this is a rough explanation.

sohta94 commented 1 year ago

Thank you for your explanation.

As you have taught me, I found lines to save the .bin file and uncommented out and executed. Now I can get steering.bin and display it in Python with np.fromfile(’steering.bin’, dtype=np.float32).

>>> vec = np.fromfile('steering.bin', dtype=np.float32)
>>> vec
array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)
>>> vec.shape
(2097152,)
>>> a.shape[0] / 512
4096.0

First question. Since the LLaMa 7B model can handle up to 512 tokens in length, does this mean that an array of 512 tokens is defined in advance, and the NumPy array is used up to the current token length with the rest filled with zeros in the C++ implement?

Second question. Is it correct that --steering-layer 4 is the layers.4 feature vector of the layers displayed during quantization as follows?

ubuntu@ubuntu:~/dlbr/llama.cpp-steering$ ./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin q4_0
main: build = 0 (unknown)
main: quantizing './models/7B/ggml-model-f16.bin' to './models/7B/ggml-model-q4_0.bin' as q4_0
llama.cpp: loading model from ./models/7B/ggml-model-f16.bin
llama.cpp: saving model to ./models/7B/ggml-model-q4_0.bin
[   1/ 291]                tok_embeddings.weight -     4096 x 32000, type =    f16, quantizing .. size =   250.00 MB ->    70.31 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[   2/ 291]                          norm.weight -             4096, type =    f32, size =    0.016 MB
[   3/ 291]                        output.weight -     4096 x 32000, type =    f16, quantizing .. size =   250.00 MB ->    70.31 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.097 0.113 0.119 0.112 0.097 0.076 0.056 0.038 0.025 0.020 
[   4/ 291]         layers.0.attention.wq.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.035 0.012 0.019 0.030 0.047 0.069 0.097 0.129 0.152 0.129 0.098 0.070 0.047 0.031 0.019 0.016 
[   5/ 291]         layers.0.attention.wk.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.035 0.012 0.020 0.032 0.049 0.072 0.098 0.125 0.139 0.125 0.099 0.072 0.050 0.033 0.021 0.017 
[   6/ 291]         layers.0.attention.wv.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.036 0.015 0.024 0.037 0.055 0.075 0.096 0.114 0.124 0.114 0.096 0.075 0.055 0.038 0.024 0.020 
[   7/ 291]         layers.0.attention.wo.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.036 0.013 0.021 0.033 0.051 0.073 0.099 0.123 0.133 0.123 0.099 0.073 0.051 0.033 0.021 0.018 
[   8/ 291]       layers.0.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[   9/ 291]      layers.0.feed_forward.w1.weight -     4096 x 11008, type =    f16, quantizing .. size =    86.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  10/ 291]      layers.0.feed_forward.w2.weight -    11008 x  4096, type =    f16, quantizing .. size =    86.00 MB ->    24.19 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.117 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[  11/ 291]      layers.0.feed_forward.w3.weight -     4096 x 11008, type =    f16, quantizing .. size =    86.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.112 0.118 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[  12/ 291]             layers.0.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
[  13/ 291]         layers.1.attention.wq.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.036 0.015 0.024 0.038 0.055 0.076 0.097 0.113 0.120 0.114 0.097 0.076 0.055 0.038 0.024 0.020 
[  14/ 291]         layers.1.attention.wk.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.036 0.015 0.024 0.038 0.055 0.076 0.097 0.114 0.121 0.114 0.097 0.076 0.055 0.037 0.024 0.020 
[  15/ 291]         layers.1.attention.wv.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.036 0.015 0.024 0.037 0.055 0.076 0.097 0.114 0.122 0.115 0.097 0.076 0.055 0.037 0.024 0.020 
[  16/ 291]         layers.1.attention.wo.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.036 0.013 0.021 0.033 0.050 0.072 0.098 0.124 0.136 0.124 0.098 0.072 0.050 0.033 0.021 0.018 
[  17/ 291]       layers.1.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[  18/ 291]      layers.1.feed_forward.w1.weight -     4096 x 11008, type =    f16, quantizing .. size =    86.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  19/ 291]      layers.1.feed_forward.w2.weight -    11008 x  4096, type =    f16, quantizing .. size =    86.00 MB ->    24.19 MB | hist: 0.037 0.016 0.025 0.039 0.056 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  20/ 291]      layers.1.feed_forward.w3.weight -     4096 x 11008, type =    f16, quantizing .. size =    86.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  21/ 291]             layers.1.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
[  22/ 291]         layers.2.attention.wq.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.118 0.112 0.097 0.077 0.056 0.038 0.025 0.020 
[  23/ 291]         layers.2.attention.wk.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.036 0.015 0.024 0.038 0.055 0.076 0.097 0.114 0.121 0.114 0.097 0.076 0.055 0.038 0.024 0.020 
[  24/ 291]         layers.2.attention.wv.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.076 0.096 0.112 0.119 0.112 0.096 0.076 0.056 0.039 0.025 0.021 
[  25/ 291]         layers.2.attention.wo.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.077 0.097 0.113 0.118 0.112 0.097 0.077 0.056 0.038 0.025 0.020 
[  26/ 291]       layers.2.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[  27/ 291]      layers.2.feed_forward.w1.weight -     4096 x 11008, type =    f16, quantizing .. size =    86.00 MB ->    24.19 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  28/ 291]      layers.2.feed_forward.w2.weight -    11008 x  4096, type =    f16, quantizing .. size =    86.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  29/ 291]      layers.2.feed_forward.w3.weight -     4096 x 11008, type =    f16, quantizing .. size =    86.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  30/ 291]             layers.2.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
[  31/ 291]         layers.3.attention.wq.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.097 0.112 0.119 0.112 0.096 0.076 0.056 0.038 0.025 0.021 
[  32/ 291]         layers.3.attention.wk.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.097 0.113 0.119 0.112 0.097 0.076 0.056 0.038 0.025 0.020 
[  33/ 291]         layers.3.attention.wv.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.112 0.118 0.112 0.096 0.076 0.056 0.039 0.025 0.021 
[  34/ 291]         layers.3.attention.wo.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.097 0.111 0.116 0.111 0.097 0.077 0.057 0.039 0.025 0.021 
[  35/ 291]       layers.3.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[  36/ 291]      layers.3.feed_forward.w1.weight -     4096 x 11008, type =    f16, quantizing .. size =    86.00 MB ->    24.19 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  37/ 291]      layers.3.feed_forward.w2.weight -    11008 x  4096, type =    f16, quantizing .. size =    86.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.097 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  38/ 291]      layers.3.feed_forward.w3.weight -     4096 x 11008, type =    f16, quantizing .. size =    86.00 MB ->    24.19 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  39/ 291]             layers.3.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
[  40/ 291]         layers.4.attention.wq.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.076 0.097 0.112 0.118 0.112 0.096 0.077 0.056 0.039 0.025 0.021 
[  41/ 291]         layers.4.attention.wk.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.097 0.112 0.119 0.112 0.097 0.076 0.056 0.038 0.025 0.020 
[  42/ 291]         layers.4.attention.wv.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.112 0.118 0.112 0.096 0.076 0.056 0.039 0.025 0.021 
[  43/ 291]         layers.4.attention.wo.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.117 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[  44/ 291]       layers.4.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[  45/ 291]      layers.4.feed_forward.w1.weight -     4096 x 11008, type =    f16, quantizing .. size =    86.00 MB ->    24.19 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  46/ 291]      layers.4.feed_forward.w2.weight -    11008 x  4096, type =    f16, quantizing .. size =    86.00 MB ->    24.19 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  47/ 291]      layers.4.feed_forward.w3.weight -     4096 x 11008, type =    f16, quantizing .. size =    86.00 MB ->    24.19 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  48/ 291]             layers.4.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
.
.
.

Third question. I also have a question about the timing of generation. I thought that while the model was generating text, steering.bin would be output every time a token was generated, i.e., every time the model inferred, but in fact it was only generated once at the beginning. In fact, the values in steering.bin were all 0 after 3. What is wrong with my thinking?

I have not tried inference from the middle tier yet and will ask that question to you again in the future.

Best regards.

SlyEcho commented 1 year ago

Well, you should read up on what the steering experiment was about after all, it is not a normal operation for llama.cpp. But I brought it up because the way it works is it first reads from some arbitrary layer output (or input, depends on the perspective) and it also writes it back later.

The steering works with multiple passes, first the positive and negative steering strings are processed, at this time the interceptor reads the embedding vectors from the layer input (layer number configurable from command line). They are added to the same output vector (when processing the positive string, multiplied by +1.0 and vice-versa). This gives one vector which is the "steering vector" that is written into the .bin file.

Then the program runs normally as it used to before, except now the steering vector is being read from, multiplied by a user-definable coefficient (theoretically, more positive: more steering effect, negative: opposite effect, zero: no effect). We also experimented with injecting the steering into a different layer than it was extracted from, which may give some different effect (didn't really have time to study this).

But the main idea of how to mess with the vectors:

  1. Create a big enough storage object in the llama context object (like a C++ vector of floats with the size n_ctx * n_embd)
  2. Create a ggml tensor in the beginning of the evaluation for using the data in the llama code. It doesn't have to cover the whole context, only the N elements of the batch.
  3. Copy the existing data into the vector using memcpy(), from the appropriate place, so the context location is n_past and size is N.
  4. In the llama evaluation code you can do operations with tensors using the ggml functions. ggml does not immediately calculate the numbers, it creates a graph first and then it is calculated later (this allows doing training and optimization like the big boy libraries Tensorflow or Pytorch do).
  5. The result has to be copied to the tensor created in step 2, then ggml_build_forward_expand() called on it.
  6. After the graph is computed at ggml_graph_compute(), copy the data into your storage created at step 1 at the appropriate location.

If you don't need to read and write at the same time some steps are optional. For example, embedding.cpp only reads data so it is simpler.

github-actions[bot] commented 5 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.