Does seldon core support streaming response?

ming-shy commented 1 year ago

Does seldon core support streaming response? For example, I have a ChatGPt-like model and I need to stream response model output results. Thank you so much!!!

agrski commented 1 year ago

Seldon Core v1 and v2 both support gRPC and (to an extent for Core v1) Kafka.

It is then about whether a particular inference protocol supports streaming responses. Core v1 supports various protocols, while Core v2 only supports the Open Inference protocol. Open Inference is a request/response protocol, and as far as I'm aware so are many other common inference protocols (TorchServe, Tensorflow Serving, etc.).

It'd be interesting to hear more about your use case, and whether something like batch outputs (many outputs in a single response) would be a better fit.

jondeandres commented 1 year ago

It'd be interesting for cases where you don't want to reply to the client with every model output but you want to buffer them. ex: when converting audio to text you might want to return the whole speaker turn, in order to do that you need some kind of session affinity so the same pod will receive all audio chunks.

I'd see interesting implementing a streaming grpc method in the Seldon protocol or implement a websocket interface too. I'd say the gRPC option would be the easiest and more long-term strategy to take.

ming-shy commented 1 year ago

How seldoncore calls grpc using python code. Is there a more detailed tutorial? I used the Proto Buffer and gRPC provided by the official website Definition (https://docs.seldon.io/projects/seldon-core/en/latest/reference/apis/prediction.html) to generate the corresponding python file, But there is no tutorial on how to use these files. Thank you so much!!

千风随 @.***

------------------ 原始邮件 ------------------ 发件人: @.>; 发送时间: 2023年6月27日(星期二) 凌晨0:10 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [SeldonIO/seldon-core] Does seldon core support streaming response? (Issue #4942)

It'd be interesting for cases where you don't want to reply to the client with every model output but you want to buffer them. ex: when converting audio to text you might want to return the whole speaker turn, in order to do that you need some kind of session affinity so the same pod will receive all audio chunks.

I'd see interesting implementing a streaming grpc method in the Seldon protocol or implement a websocket interface too. I'd say the gRPC option would be the easiest and more long-term strategy to take.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

ukclivecox commented 1 year ago

Certainly we see streaming as key for LLM usage for token by token generation and control. At present, Triton does have streaming grpc as an extension to the Open Inference Protocol. We need to look to extending this support and how this would work across V2 seldon core, e.g. not just to Models but also Pipelines.

SeldonIO / seldon-core

Does seldon core support streaming response? #4942