dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.15k stars 4.71k forks source link

[API Proposal]: Add Vector Embedding Type #102669

Open ShivangiReja opened 4 months ago

ShivangiReja commented 4 months ago

Background and motivation

Currently, AI libraries in the .NET ecosystem, e.g. OpenAI, Azure AI Search, use ReadOnlyMemory<float> to represent embedding vectors. However, embeddings can be of narrower types such as int8, int16, float16, etc., which consume less memory, providing both cost and performance benefits. This proposal aims to introduce a versatile container for embeddings that can handle various data types, enabling more efficient memory usage and broader interoperability among different services (e.g., retrieving vectors from services like OpenAI and storing them in vector databases like Azure Search).

API Proposal

// package: ?
namespace System.Numerics; // another options: System.AI

public abstract class EmbeddingVector
{
    public virtual EmbeddingVector<T> To<T>();

    public static EmbeddingVector FromJson(ReadOnlyMemory<byte> utf8Json);
    public static EmbeddingVector FromBase64(ReadOnlyMemory<byte> utf8Base64)
    public static EmbeddingVector<T> FromScalars<T>(ReadOnlyMemory<T> scalars)

    // possible additions:
    // public string ModelName { get; protected set; }
    // public string Precision { get; protected set; }
    // public abstract int Length { get; }

    public abstract void Write(Stream stream, string format);
}

public sealed class EmbeddingVector<T> : EmbeddingVector
{
    public EmbeddingVector(ReadOnlyMemory<T> scalars);
    public ReadOnlyMemory<T> Scalars { get; } 
}

API Usage

EmbeddingVector vector = EmbeddingVector.FromJson("[-0.0026168018,-0.024089903,0.03355637]"u8.ToArray());
EmbeddingVector<float> floats = vector.To<float>();
foreach(float scalar in  floats.Scalars.Span)
{
    Console.WriteLine(scalar);
}

Here's how we can use it with OpenAI, which returns a base64 encoded string:

EmbeddingClient client = new("text-embedding-ada-002", Environment.GetEnvironmentVariable("OPENAI-API-KEY"));
ClientResult<Embedding> response = client.GenerateEmbedding("Top hotel in town");

And here's how we can use it with Azure Search, which returns a JSON array:

// Get embedding from OpenAI
EmbeddingClient client = new("text-embedding-ada-002", Environment.GetEnvironmentVariable("OPENAI-API-KEY"));
Embedding embedding = client.GenerateEmbedding("Top hotel in town");
EmbeddingVector vector = embedding.Vector;

// Call Azure AI Search passing in the vector
Uri endpoint = new(Environment.GetEnvironmentVariable("SEARCH_ENDPOINT"));
AzureKeyCredential credential = new AzureKeyCredential(Environment.GetEnvironmentVariable("SEARCH_API_KEY"));
SearchClient searchClient = new SearchClient(endpoint, "mysearchindex", credential);

Response<SearchResults<Hotel>> response = searchClient.Search<Hotel>(
        new SearchOptions
        {
            VectorSearch = new()
            {
                Queries = { new VectorizedQuery(vector) { KNearestNeighborsCount = 3, Fields = { "DescriptionVector" } } }
            }
        });

For end-to-end working examples, please see: EmbeddingType/Program.cs

Alternative Designs

No response

Risks

No response

Discussion Points

dotnet-policy-service[bot] commented 4 months ago

Tagging subscribers to this area: @dotnet/area-system-numerics See info in area-owners.md if you want to be subscribed.

annelo-msft commented 4 months ago

I'm interesting in whether we'd consider an alternate name for the type of VectorEmbedding (instead of EmbeddingVector). My understanding is that there are multiple meanings for "embeddings" in the deep-learning space, so "embedding" here is a noun and "vector" is the adjective that differentiates which type of embedding you're referring to.

As a higher-level point, as we add types to the BCL that map to concepts in AI/ML, it feels like there's value in having their names align well with the terminology used for the same concepts. One benefit of this might be that folks searching online for more information about what the type represents would more easily land on documentation from the broader ML community. For this one, I found many articles referencing "vector embeddings" but few on "embedding vectors."

Wraith2 commented 4 months ago

Is there reason to have this in the BCL rather than a standalone library?

Giorgi commented 4 months ago

Why not make it part of SmartComponents.LocalEmbeddings?