Closed jaemin-han closed 6 months ago
Thanks a lot for kind reply!! Do you mind if I ask some more about your reply..?
Does x1.99 acceleration originated from the evaluation code in evaluate_sum? I tried it to reproduce the result but our acceleration only achieved ~x1.54 acceleration(paper 1.722) in XSUM Llama2-70B ~x1.34 acceleration(paper 1.435) in your evaluate_sum.ipynb code. We use 2 A100 80GB GPU same like your setting
Does "model-level draft model" mean that the same draft model can be used for different task(task agnostic)?? Can same skipped layer index could be applicable not only for summarization but for QA, Translation or instruction following task?
And also is there specific or empirical reason that you didn't use speculative "sampling"(which is rejection sampling based method) to your paper and evaluation??
Thanks for fast and kind reply!
Thanks!! Is there large difference in mean matchness between x1.99 you mentioned?? And also does new version specify this repository? or still not opened version will be released?
We only maintain this repository. We wonder whether your version of transformers has flash attention, thereby reducing the proportion of attention overhead. The logs of our two datasets on 70b are as follows:
CNN/DM: data 999,{'mean rouge-2 base': '0.1308', 'mean rouge-2 essg autoth 0.5820 alpha 0.85': '0.1305', 'mean rouge-2 essg autoth 0.6930 alpha 0.90': '0.1308', 'mean time base': '114.3865', 'mean time essg autoth 0.5820 alpha 0.85': '57.4298', 'mean time essg autoth 0.6930 alpha 0.90': '58.6864', 'E2E mean speed up essg autoth 0.5820 alpha 0.85': '1.9918', 'E2E mean speed up essg autoth 0.6930 alpha 0.90': '1.9491', 'mean token time base': '0.2234', 'mean token time essg autoth 0.5820 alpha 0.85': '0.1122', 'mean token time essg autoth 0.6930 alpha 0.90': '0.1146', 'E2E mean token speed up essg autoth 0.5820 alpha 0.85': '1.9918', 'E2E mean token speed up essg autoth 0.6930 alpha 0.90': '1.9491', 'mean matchness essg autoth 0.5820 alpha 0.85': '0.9279', 'mean matchness essg autoth 0.6930 alpha 0.90': '0.9392', 'mean num_drafted_tokens essg autoth 0.5820 alpha 0.85': '467.8750', 'mean num_drafted_tokens essg autoth 0.6930 alpha 0.90': '452.1800'}
XSum: data 999,{'mean rouge-2 base': '0.1188', 'mean rouge-2 essg autoth 0.9550 alpha 0.85': '0.1187', 'mean rouge-2 essg autoth 0.9690 alpha 0.90': '0.1181', 'mean time base': '105.8600', 'mean time essg autoth 0.9550 alpha 0.85': '67.9628', 'mean time essg autoth 0.9690 alpha 0.90': '71.4910', 'E2E mean speed up essg autoth 0.9550 alpha 0.85': '1.5576', 'E2E mean speed up essg autoth 0.9690 alpha 0.90': '1.4807', 'mean token time base': '0.2068', 'mean token time essg autoth 0.9550 alpha 0.85': '0.1327', 'mean token time essg autoth 0.9690 alpha 0.90': '0.1396','E2E mean token speed up essg autoth 0.9550 alpha 0.85': '1.5576', 'E2E mean token speed up essg autoth 0.9690 alpha 0.90': '1.4807', 'mean matchness essg autoth 0.9550 alpha 0.85': '0.8697', 'mean matchness essg autoth 0.9690 alpha 0.90': '0.8702', 'mean num_drafted_tokens essg autoth 0.9550 alpha 0.85': '409.8460', 'mean num_drafted_tokens essg autoth 0.9690 alpha 0.90': '382.9690'}
OK the base model generation speed could be the reason. Could you provide the mean of 1000 evaluation data's mean of column 'mean token time base' for both dataset XSum and CNN/DM??
data 999,{'mean rouge-2 base': '0.1298', 'mean rouge-2 essg autoth': '0.1296', 'mean time base': '65.1325', 'mean time essg autoth': '42.4775', 'E2E mean speed up essg autoth': '1.5333', 'mean token time base': '0.1272', 'mean token time essg autoth': '0.0830', 'E2E mean token speed up essg autoth': '1.5333', 'mean matchness essg autoth': '0.9243', 'mean num_drafted_tokens essg autoth': '449.0070'}
these are our output and we conducted on 2 A100 80GB..
After stable results, approximately 114 for CNN/DM, 106 for XSum.
You mean 114ms and 106 ms?? data 999 you showed has the number of XSUM 0.2068=206.8ms and CNN 0.2234=223.4ms for 'mean token time base'.
Sorry, I mistakenly read it as the total time. Data 0, Data499, Data999 correct the corresponding columns as follows: CNN/DM: 0.2700, 0.2246, 0.2234 XSum: 0.1948, 0.2079, 0.2068
same data with corresponding columns: XSum 0.1215 0.1221 0.1218 CNN/DM 0.1482 0.1280 0.1272
I actually run the same code but base line speed seems too slower in yours...
print(transformers.__version__)
4.34.0
these are my version of transformers package and I didn't use use_flash_attn_2 option when calculating base line pseed. Even something might be different with yours and my setting this amount of latency difference seems abnormal.
When it is convenient for you, you can try transformers 4.33.1 or search through BO again in your environment. With so many layers skipped and a relatively high acceptance rate, it shouldn't be this score. Of course, we will investigate when we have time.
I am exploring how to make the LLaMA-2 70B model work faster and found the 'skip_layers.json' file on your GitHub. Are the layers listed in this file the same as those used in Table 1 of your paper for experiments?
Also, were these layers used in both the Xsum and CNN/DM datasets?
One more thing, I saw experiments with the LLaMA-2 13B model for chat, but not with the LLaMA-2 70B model. Can you explain why the LLaMA-2 70B model was not used for chat?
Thank you for your help.