I'm currently using a language model with MMS and generations are slow on the instance we're running. In order to alleviate this problem on the front end we need to return tokens as soon as the are generated instead of returning a sequence of tokens. This way the user gets immediate feedback on their generation rather than waiting for the full sequence to be returned.
Is there any way to accomplish this natively in MMS?
I'm currently using a language model with MMS and generations are slow on the instance we're running. In order to alleviate this problem on the front end we need to return tokens as soon as the are generated instead of returning a sequence of tokens. This way the user gets immediate feedback on their generation rather than waiting for the full sequence to be returned.
Is there any way to accomplish this natively in MMS?