abertsch72 / unlimiformer

Public repo for the NeurIPS 2023 paper "Unlimiformer: Long-Range Transformers with Unlimited Length Input"
MIT License
1.05k stars 77 forks source link

Errors on running llama with `test_datastore` #41

Closed wywyWang closed 11 months ago

wywyWang commented 11 months ago

Dear Authors,

I am running llama2 with Unlimiformer and want to investigate the data store, which is enabled with --test_datastore, but I find that there are some errors:

  1. The function process_key_value in UnlimiformerLLaMa have an error (L1056) where capturers are injected by function activation_to_capture instead of get_kv_projections. I am not sure if the solution is to add another capturer for recording the outputs from get_kv_projections.
  2. I tried to use another capturer and it could run successfully with my data; however, it could only preserve less than 1% of the attention mass instead of 99% of the attention (code). Have you tested llama2 on this or could you guide me if I did something incorrectly?

Thank you!

urialon commented 11 months ago

Hi @wywyWang , Thank you for your interest in our work!

I'm not sure how did you run into problems 1+2. What exactly is the command line that you ran?

We have tested llama2 on this, as described in the README. What happens if you run our exact command line?

Best, Uri

wywyWang commented 11 months ago

Hi @urialon ,

Thanks for your response! I got the same issues if I ran with the example codes with disabling gpu as below. It could be run successfully without --test_daatstore. The first issue is because llama2 does not enter L119 to have either k or v projection matrics.

python src/run_generation.py --model_type llama --model_name_or_path src/llama-2-7b-chat-hf --prefix "<<SYS>>\n You are a helpful assistant. Answer with detailed responses according to the entire instruction or question. \n<</SYS>>\n\n [INST] Summarize the following book: " --prompt example_inputs/harry_potter.txt --suffix " [/INST]" --test_unlimiformer --length 200 --layer_begin 16 --index_devices 0 --datastore_device 0 --no_cuda --test_datastore

Additional note: I set the default values of gpu_datastore and gpu_index to False, and the preservation is 0.2% of this case.

urialon commented 11 months ago

Hi @wywyWang ,

  1. What exactly is the command line that you ran?

  2. What happens if you run our exact command line from the README?

  3. What happens if you run our exact command line from the README but with --gpu_datastore=False --gpu_index=False?

Best, Uri

wywyWang commented 11 months ago
  1. The exact command is provided in the last reply, which is the same as the README. Slight modifications such as the model path won't affect the result (just from huggingface to local).
  2. It produces the summarized results and all are fine.
  3. Same. The problem only occurs when adding --test_datastore. Aren't you faced the error when running the exact command line from the README and with --test_datastore?

Best, Wei-Yao

urialon commented 11 months ago

Hi @wywyWang ,

Thanks for the clarification. --test_datastore is kind of an internal assertion that we used during debugging and can be removed. It is OK if it doesn't work anymore.

Thanks for highlighting this, we need to clean the code and remove it.

Best, Uri

wywyWang commented 11 months ago

Hi @urialon ,

Thanks for your information! I was wondering which scenarios of datasets and models could preserve 99% of the attention mass. Is this property not guaranteed? Thank you.

urialon commented 11 months ago

In our initial experiments, it happened with the BART models and summarization datasets we used in the paper.

But since you're saying that only 1% is preserved in your case, I think that we're not talking about the same thing.

What we meant in the paper is the following:

Suppose that you have the attention scores over the full sequence of 20,000 keys, after softmax, for a given query (one cross-attention head in a certain decoder layer at a certain decoding timestep). This is a vector of 20,000 numbers, all of them between 0 and 1, and they sum to 1.

Now, suppose that you mask (set to zero) all these scores, except for the highest 1000 of them. What is the sum of the remaining 1000 scores?

If they were initially all equal to 1/20000, you would preserve 1/20 = 5% of the probability mass by keeping only 1000 of them. So it's already higher than 1%.

In practice, the attention score distribution is so focused on a subset of the keys, that keeping only the top-1000 and summing them results in 0.99 (and in practice these are re-normalized such that they sum to 1 exactly)

Does that make sense?

wywyWang commented 11 months ago

Thank you for the detailed explanation. I misled by the assertation, which caused a very low value. As you were saying that was for internal use, this makes sense to me now.

Thank you!