awslabs / multi-model-server

Multi Model Server is a tool for serving neural net models for inference
Apache License 2.0
998 stars 230 forks source link

Is there anyway to yeild from MMS asynchronously? #971

Closed collinarnett closed 3 years ago

collinarnett commented 3 years ago

I'm currently using a language model with MMS and generations are slow on the instance we're running. In order to alleviate this problem on the front end we need to return tokens as soon as the are generated instead of returning a sequence of tokens. This way the user gets immediate feedback on their generation rather than waiting for the full sequence to be returned.

Is there any way to accomplish this natively in MMS?

collinarnett commented 3 years ago

I think the solution we'll go for is using AWS Kinesis during inference to stream tokens as they get generated until there's support natively in MMS.