Open ostix360 opened 5 months ago
This PR add a file that contains the minimal code to infer the model with a consistent output.
This seems very slow to infer 100 tokens but output a consistent output
What do you think?
Hi, thanks for your contribution! I will test it. When you say "slow", is it in comparison to generating of the same amount of tokens with the base model? did you add the amount of "thoughts" tokens in the comparison?
My tests are based on a 4070 ti that's why there is load_in_8bits=True in the script (to fit the model in the 12Go VRAM) Of course the model mistral 7 (for the roiginal model I take mistral instruct) | model | quiet star | original model |
---|---|---|---|
token generated | 400 | 50 | |
usefull token | 50 | 50 | |
time to generate (s) | 1055 | 17 | |
token per second | 0.38 | 2.94 | |
second per token | 2.64 | 0.34 |
edit: This big difference between the two generation speeds may be due to the context storage in memory.
Hi @ostix360, thank you so much for the contribution! I've run your infer code but the output doesn't make much sense to me... Can you explain it a bit more?
This is the whole output, it looks even weirder...
This PR add a file that contains the minimal code to infer the model with a consistent output.
This seems very slow to infer 100 tokens but output a consistent output
What do you think?