Open alex-dixon opened 6 hours ago
Looked in the code a bit.
One approach is to have streaming versions of the provider calls. This makes it clear we should be streaming all the way through. separate code paths for streaming in complex and or track may help too. There’s a clear spot to branch on track vs track_streaming for example if that helps us.
How should the api should surface at the user level? Eg
Stream as a value returned by an lmp when stream is true for c in lmp(a,b, stream=True): …
as a property of whatever result type… (nonstreaming is list[Message] result, _ = lmp(a,b, stream=True)
for c in result.stream():
Other approaches: normalize on python’s “lazy sequence” abstraction by forcing a stream api in the provider layer and calling “list” on the result (collecting it) when stream is false. Unfortunately not all provider api calls support streaming so we’d artificially force this in some cases. Provider code is already stream aware by default for text and they just collect the stream. So the extra work would be a faux streaming api in ell (iterate over messages and yield parts of them). Could be something here. Need to review the OpenAI stream chunk data structures and see how they map to ell types. If we can have a normal chunk returned by providers then ell should be able to 1. Yield the chunks when stream true 2. collect them into messages otherwise
A couple users have asked about a streaming api. Specifically a way to receive chunked output for text blocks and structured output.
How to make this work with tracking?