Open Davidyao99 opened 9 months ago
The model is trained under 16-bit float precision. So loading in 4/8bit float precision can inference faster.
conv-mode
is the conversation template.
max-new-token
is the maximum tokens that model generates.
does the model that loads the 8-bit float would get a lower performance than 16-bit in the downstream tasks? @LinB203
Great work, may I clarify what do the different parameters in cli.py do?
Specifically, what does load-4bit, load-8bit, conv-mode and max-new-tokens do?
Thank you!! Trying to understand the parameters better so that I can tune them for my specific task! ;)