Unable to reimplement 1.4x speedup with llama-2-chat

hemingkx commented 3 months ago

Hi Zhang: Thanks for your nice work! I tried to reimplement the results of Self-Spec with suggested skip layers in skip_layers.json. The llama-13b worked fine. However, I could not obtain significant speedup with llama-2-13b-chat. The following are the results:

essg th4: 0.8000
data 0,{'mean rouge-2 base': '0.0496', 'mean rouge-2 essg th 0.8': '0.0496', 'mean time base': '26.6684', 'mean time essg th 0.8': '25.0143', 'E2E mean speed up essg th 0.8': '1.0661', 'mean token time base': '0.0521', 'mean token time essg th 0.8': '0.0575', 'E2E mean token speed up essg th 0.8': '0.9058', 'mean matchness essg th 0.8': '0.7292', 'mean draft tokens essg th 0.8': '3.3150', 'mean matched tokens essg th 0.8': '2.4173'} 
essg th4: 0.8000
data 1,{'mean rouge-2 base': '0.1418', 'mean rouge-2 essg th 0.8': '0.1418', 'mean time base': '15.0166', 'mean time essg th 0.8': '14.3971', 'E2E mean speed up essg th 0.8': '1.0430', 'mean token time base': '0.0523', 'mean token time essg th 0.8': '0.0583', 'E2E mean token speed up essg th 0.8': '0.8979', 'mean matchness essg th 0.8': '0.7124', 'mean draft tokens essg th 0.8': '2.3994', 'mean matched tokens essg th 0.8': '1.7248'} 
essg th4: 0.8000
data 2,{'mean rouge-2 base': '0.1639', 'mean rouge-2 essg th 0.8': '0.1639', 'mean time base': '12.1483', 'mean time essg th 0.8': '12.1476', 'E2E mean speed up essg th 0.8': '1.0001', 'mean token time base': '0.0545', 'mean token time essg th 0.8': '0.0622', 'E2E mean token speed up essg th 0.8': '0.8755', 'mean matchness essg th 0.8': '0.7011', 'mean draft tokens essg th 0.8': '2.7663', 'mean matched tokens essg th 0.8': '1.9415'} 
essg th4: 0.8000
data 3,{'mean rouge-2 base': '0.2048', 'mean rouge-2 essg th 0.8': '0.2048', 'mean time base': '15.8169', 'mean time essg th 0.8': '16.5565', 'E2E mean speed up essg th 0.8': '0.9553', 'mean token time base': '0.0540', 'mean token time essg th 0.8': '0.0612', 'E2E mean token speed up essg th 0.8': '0.8815', 'mean matchness essg th 0.8': '0.7279', 'mean draft tokens essg th 0.8': '2.9630', 'mean matched tokens essg th 0.8': '2.1739'} 
essg th4: 0.8000
data 4,{'mean rouge-2 base': '0.1871', 'mean rouge-2 essg th 0.8': '0.1871', 'mean time base': '13.6241', 'mean time essg th 0.8': '14.4017', 'E2E mean speed up essg th 0.8': '0.9460', 'mean token time base': '0.0534', 'mean token time essg th 0.8': '0.0612', 'E2E mean token speed up essg th 0.8': '0.8731', 'mean matchness essg th 0.8': '0.7133', 'mean draft tokens essg th 0.8': '2.8406', 'mean matched tokens essg th 0.8': '2.0473'} 
essg th4: 0.8000

I wonder if there is something that I missed, for example, the prompt format (I used the default code in evaluate_sum.ipynb). Your help would be appreciated, thanks a lot!

Best regards, hemingkx

junzhang-zj commented 2 months ago

Since I haven't run the chat model for a long time, I will give you feedback after I have time to test it.

hemingkx commented 2 months ago

Thx! After discussion via wechat, this issue has been resolved.

dilab-zju / self-speculative-decoding

Unable to reimplement 1.4x speedup with llama-2-chat #15