[API Proposal]: Streaming methods for IEmbeddingGenerator

azchohfi commented 3 weeks ago

Background and motivation

The IEmbeddingGenerator interface doesn't support streaming, which makes sense mostly with batching (for remote/cloud implementations) or with local embeddings models, that runs much slower in the CPU, for example.

API Proposal

namespace System.Collections.Generic;

public class LocalEmbeddingGenerator : IEmbeddingGenerator<string, Embedding<float>>
{
  public async IAsyncEnumerable<Embedding<float>> GenerateStreamingAsync(
      IEnumerable<string> values,
      EmbeddingGenerationOptions? options = null,
      [EnumeratorCancellation] CancellationToken cancellationToken = default)
  {
      int chunkSize = 128;

      var chunks = values.Chunk(chunkSize);

      foreach (var chunk in chunks)
      {
          cancellationToken.ThrowIfCancellationRequested();

          GeneratedEmbeddings<Embedding<float>> embeddings = await GenerateAsync(chunk, options, cancellationToken).ConfigureAwait(false);

          foreach (var embedding in embeddings)
          {
              cancellationToken.ThrowIfCancellationRequested();

              yield return embedding;
          }
      }
  }
...
}

API Usage

IEmbeddingGenerator<string, Embedding<float>> localEmbeddings = new LocalEmbeddingGenerator();
await foreach (var embedding in localEmbeddings.GenerateStreamingAsync(largeEnumerable, null, cts.Token))
{
   ...
}

Alternative Designs

Such method is likely going to be similar between different implementations, so maybe an extension method would suffice.

Risks

It does make the implementation slightly more complex, and maybe the existing implementations would only call the GenerateAsync method, without leveraging chunking or a similar approach.

stephentoub commented 1 week ago

@SteveSandersonMS, opinions on this?

stephentoub commented 3 hours ago

@luisquintanilla, @SteveSandersonMS, do we want to do anything with this one, or just say it's up to consumers?

SteveSandersonMS commented 3 hours ago

Personally I think we'd want to support streaming if it was inherent to the underlying generator (as it is for chat). But since it isn't inherent, it feels more like a pattern that consumers would apply themselves if they want it. Then it's up to the consumer to decide things like whether to parallelize the chunks.

So I'd vote for not layering on this concept ourselves when it's not inherent to the concept of embedding generation.

dotnet / extensions