skipped layer and llama2 70b chat

jaemin-han commented 7 months ago

I am exploring how to make the LLaMA-2 70B model work faster and found the 'skip_layers.json' file on your GitHub. Are the layers listed in this file the same as those used in Table 1 of your paper for experiments?

Also, were these layers used in both the Xsum and CNN/DM datasets?

One more thing, I saw experiments with the LLaMA-2 13B model for chat, but not with the LLaMA-2 70B model. Can you explain why the LLaMA-2 70B model was not used for chat?

Thank you for your help.

junzhang-zj commented 7 months ago

This result is a layer distribution that can achieve 1.99 times acceleration after the update.
Yes, we generate draft models at the model level, so XSum and CNN/DM are the same.
Finally, it can be used. This adaptation is simple, but we don’t have much energy to adapt too many models now. The conclusion is of course the same. The LLaMA-chat and CodeLLaMA are adopted to support the advantage that we do not need to try a draft model for an already fine-tuned model.

je1lee commented 7 months ago

Thanks a lot for kind reply!! Do you mind if I ask some more about your reply..?

Does x1.99 acceleration originated from the evaluation code in evaluate_sum? I tried it to reproduce the result but our acceleration only achieved ~x1.54 acceleration(paper 1.722) in XSUM Llama2-70B ~x1.34 acceleration(paper 1.435) in your evaluate_sum.ipynb code. We use 2 A100 80GB GPU same like your setting
Does "model-level draft model" mean that the same draft model can be used for different task(task agnostic)?? Can same skipped layer index could be applicable not only for summarization but for QA, Translation or instruction following task?

And also is there specific or empirical reason that you didn't use speculative "sampling"(which is rejection sampling based method) to your paper and evaluation??

junzhang-zj commented 7 months ago

Yes, if nothing else, the code should not have changed. What is the alpha setting in your adaptive early exiting, and what is the acceptance rate in the inference results?
No, the reason for our model-level is that we randomly sampled 4 samples each of XSum and CNN/DM during the BO process. If applied to other tasks, we recommend re-searching and using data 4~8 would be better.
The last question I didn't understand, our sampling strategy considers greedy and nucleus sampling based on previous work.

je1lee commented 7 months ago

mean matchness was 0.924240 in CNN/DM and 0.868554 in XSUM , auto_parameters=[1,0.50,0.90,1e-2,0.90] was our setting.
Thanks for replying!
What I meant was why not adopting self-speuclative-sampling("sss" or "self_speculative_sample" in your code) additionally, I didn't clearly understand the reason why auto stop draft threshold is adapted, can you give me a more precise intuition of why it is used? Because sometimes, static threshold of 0.8 is nearly even with auto threshold and sometimes static is faster than auto thresholding

Thanks for fast and kind reply!

junzhang-zj commented 7 months ago

I suggest you try alphas of 0.8 for CNN/DM, 0.85 for XSum, these are the updated parameters for our new version.
The SSS tends to be used in scenarios that require sampling, such as code generation or solving math problems, to generate different answers given randomness; otherwise greedy should be a better choice for most tasks. To avoid case-by-case threshold adjustment during adaptive exit, different models and tasks have different requirements for thresholds. As can be seen from Figure 5 of our article, this under new tasks and models is probably more important, especially if the difficulty of the sentences varies greatly.

je1lee commented 7 months ago

Thanks!! Is there large difference in mean matchness between x1.99 you mentioned?? And also does new version specify this repository? or still not opened version will be released?

junzhang-zj commented 7 months ago

We only maintain this repository. We wonder whether your version of transformers has flash attention, thereby reducing the proportion of attention overhead. The logs of our two datasets on 70b are as follows: CNN/DM: data 999,{'mean rouge-2 base': '0.1308', 'mean rouge-2 essg autoth 0.5820 alpha 0.85': '0.1305', 'mean rouge-2 essg autoth 0.6930 alpha 0.90': '0.1308', 'mean time base': '114.3865', 'mean time essg autoth 0.5820 alpha 0.85': '57.4298', 'mean time essg autoth 0.6930 alpha 0.90': '58.6864', 'E2E mean speed up essg autoth 0.5820 alpha 0.85': '1.9918', 'E2E mean speed up essg autoth 0.6930 alpha 0.90': '1.9491', 'mean token time base': '0.2234', 'mean token time essg autoth 0.5820 alpha 0.85': '0.1122', 'mean token time essg autoth 0.6930 alpha 0.90': '0.1146', 'E2E mean token speed up essg autoth 0.5820 alpha 0.85': '1.9918', 'E2E mean token speed up essg autoth 0.6930 alpha 0.90': '1.9491', 'mean matchness essg autoth 0.5820 alpha 0.85': '0.9279', 'mean matchness essg autoth 0.6930 alpha 0.90': '0.9392', 'mean num_drafted_tokens essg autoth 0.5820 alpha 0.85': '467.8750', 'mean num_drafted_tokens essg autoth 0.6930 alpha 0.90': '452.1800'}

XSum: data 999,{'mean rouge-2 base': '0.1188', 'mean rouge-2 essg autoth 0.9550 alpha 0.85': '0.1187', 'mean rouge-2 essg autoth 0.9690 alpha 0.90': '0.1181', 'mean time base': '105.8600', 'mean time essg autoth 0.9550 alpha 0.85': '67.9628', 'mean time essg autoth 0.9690 alpha 0.90': '71.4910', 'E2E mean speed up essg autoth 0.9550 alpha 0.85': '1.5576', 'E2E mean speed up essg autoth 0.9690 alpha 0.90': '1.4807', 'mean token time base': '0.2068', 'mean token time essg autoth 0.9550 alpha 0.85': '0.1327', 'mean token time essg autoth 0.9690 alpha 0.90': '0.1396','E2E mean token speed up essg autoth 0.9550 alpha 0.85': '1.5576', 'E2E mean token speed up essg autoth 0.9690 alpha 0.90': '1.4807', 'mean matchness essg autoth 0.9550 alpha 0.85': '0.8697', 'mean matchness essg autoth 0.9690 alpha 0.90': '0.8702', 'mean num_drafted_tokens essg autoth 0.9550 alpha 0.85': '409.8460', 'mean num_drafted_tokens essg autoth 0.9690 alpha 0.90': '382.9690'}

je1lee commented 7 months ago

OK the base model generation speed could be the reason. Could you provide the mean of 1000 evaluation data's mean of column 'mean token time base' for both dataset XSum and CNN/DM??

data 999,{'mean rouge-2 base': '0.1298', 'mean rouge-2 essg autoth': '0.1296', 'mean time base': '65.1325', 'mean time essg autoth': '42.4775', 'E2E mean speed up essg autoth': '1.5333', 'mean token time base': '0.1272', 'mean token time essg autoth': '0.0830', 'E2E mean token speed up essg autoth': '1.5333', 'mean matchness essg autoth': '0.9243', 'mean num_drafted_tokens essg autoth': '449.0070'}

these are our output and we conducted on 2 A100 80GB..

junzhang-zj commented 7 months ago

~~After stable results, approximately 114 for CNN/DM, 106 for XSum.~~

je1lee commented 7 months ago

You mean 114ms and 106 ms?? data 999 you showed has the number of XSUM 0.2068=206.8ms and CNN 0.2234=223.4ms for 'mean token time base'.

junzhang-zj commented 7 months ago

Sorry, I mistakenly read it as the total time. Data 0, Data499, Data999 correct the corresponding columns as follows: CNN/DM: 0.2700, 0.2246, 0.2234 XSum: 0.1948, 0.2079, 0.2068

je1lee commented 7 months ago

same data with corresponding columns: XSum 0.1215 0.1221 0.1218 CNN/DM 0.1482 0.1280 0.1272

I actually run the same code but base line speed seems too slower in yours...

print(transformers.__version__)
4.34.0

these are my version of transformers package and I didn't use use_flash_attn_2 option when calculating base line pseed. Even something might be different with yours and my setting this amount of latency difference seems abnormal.

junzhang-zj commented 7 months ago

When it is convenient for you, you can try transformers 4.33.1 or search through BO again in your environment. With so many layers skipped and a relatively high acceptance rate, it shouldn't be this score. Of course, we will investigate when we have time.

dilab-zju / self-speculative-decoding

skipped layer and llama2 70b chat #8