Max tokens limits responses on a given request_id

exo-explore / exo

Run your own AI cluster at home with everyday devices 📱💻 🖥️⌚

GNU General Public License v3.0

10.99k stars 639 forks source link

Max tokens limits responses on a given request_id #129

Open AlexCheema opened 3 months ago

AlexCheema commented 3 months ago

Right now, max_generate_tokens option limits the total number of tokens a given request can return. The desired behaviour is that it should limit the number of tokens on a given generation until we either hit the length limit or the generation limit.

We should refactor buffered token output broadcasts to send only the deltas (also fixes the n^2 bandwidth problem), which should make this easier.