Closed didadida-r closed 3 months ago
fix max_tokens cache similar issue has been solved, pls check carefully
@PoTaTo-Mika hi, i have re pull the lastest code, and there is not update code in llama, the re-compiled error still exist. Could you explain this issue more clearly about how to fix the max_tokens cache to avoid recompile. Thanks.
thanks
Feel free to ask any kind of questions in the issues page, but please use English since other users may find your questions valuable.
Describe the bug A clear and concise description of what the bug is. hi, each time i run the generate, the model will re-compiled again. The infer did become faster after compiled(47.72it/s), but adding the re-compiled time, the total rtf is even bad.
python info:
GPU info:
To Reproduce Steps to reproduce the behavior:
Expected behavior A clear and concise description of what you expected to happen.
Screenshots / log
Additional context Add any other context about the problem here.