Right now, max_generate_tokens option limits the total number of tokens a given request can return.
The desired behaviour is that it should limit the number of tokens on a given generation until we either hit the length limit or the generation limit.
We should refactor buffered token output broadcasts to send only the deltas (also fixes the n^2 bandwidth problem), which should make this easier.
Right now,
max_generate_tokens
option limits the total number of tokens a given request can return. The desired behaviour is that it should limit the number of tokens on a given generation until we either hit the length limit or the generation limit.We should refactor buffered token output broadcasts to send only the deltas (also fixes the n^2 bandwidth problem), which should make this easier.