FasterDecoding / REST

REST: Retrieval-Based Speculative Decoding, NAACL 2024
Apache License 2.0
177 stars 11 forks source link

Incompatibility issues with llama-7b model #3

Closed preminstrel closed 9 months ago

preminstrel commented 9 months ago

Hello, very interesting research, I have some small problems when reproducing it.

First, I reproduced vicuna-7b-v1.5 on MT-Bench (Greedy), and the result was 1.991, which is good and consistent with the paper.

However, when I switch vicuna-7b to llama-7b, a lot of ERROR question IDs are generated, and the performance is not very good. (1.99-> 1.81)

Is this method not very good for some models, or does it have special requirements for the selection of datastore?

Thanks for your help and time.

zhenyuhe00 commented 9 months ago

Hi, thanks for your interest in REST.

As for the ERROR question IDs, since the output of REST is identical to that of the standard autoregressive decoding, I suppose it's the problem brought by the original llama-7b on these question IDs.

As for the speedup not very good for llama-7b on MT-Bench, I recommend you to use LLaMA-Chat over LLaMA on this conversational dataset, which would have better alignment with the datastore built with ShareGPT or UltraChat. If you still stick to using (not instruction-tuned) LLaMA on MT-Bench, you may consider choosing a datastore that has better alignment with LLaMA (e.g., The Pile or RedPajama-Data).

The selection of the datastore indeed plays a crucial role in the speedup of REST. The better the datastore aligned with the model, the faster the speedup. In your case, llama-7b is not aligned with the datastore built with ShareGPT or UltraChat, which may result in poor performance.

If you have any further questions, please feel free to contact me.

preminstrel commented 9 months ago

Hello, thanks for your reply profusely!

It helps me to further understand your method!

Moreover, I noticed that your method basically retrieves next token based on recent 16 tokens? It can do good with normal language modeling tasks. But I think it may cannot deal with task that depends on information far away, like retrieving a certain password in 4k context?

If I am wrong, please correct me.

Thanks for your inspiring work!

zhenyuhe00 commented 9 months ago

Hi, according to our arXiv paper in Algorithm 1, Section 3.2, We aim to find contexts in D that match the longest suffix of s. So ideally $n{max}$ is set to the length of the current context and gradually decreases by one until there is a match in the datastore. The search operations are efficiently conducted with the help of suffix array (as for the code implementation, please refer to this for loop). In practice, we find that setting $n{max}$ to 16 is enough in our datastore scale.

As for "retrieving a certain password in 4k context", I wonder if you could provide me more details about this task. Do you mean the passkey retrieval task proposed by this paper?

preminstrel commented 9 months ago

The task is like, you have many key-password pairs. And You given one key, you need to find the correct password in many pairs.

I think the advantage of the method is to pre-find some patterns in language modeling. But when it comes to such tasks, you cannot find a specific pattern.

preminstrel commented 9 months ago

Like

KEY PASSWORD
123 fdjsakfhska
783 djskalhdfqq
...

You are given KEY=391, what is the PASSWORD?

I think there will be no pair in our pre-defined datastore?

zhenyuhe00 commented 9 months ago

Like

KEY PASSWORD
123 fdjsakfhska
783 djskalhdfqq
...

You are given KEY=391, what is the PASSWORD?

I think there will be no pair in our pre-defined datastore?

If there are no substrings of KEY-PASSWORD pairs in the pre-defined datastore, yes, the generation speed can not be accelerated. In cases where the context provided is important, you could consider using the context or adding the context to the datastore.

If you have any further questions, please feel free to contact me.