Azure / azure-sdk-for-java

This repository is for active development of the Azure SDK for Java. For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/java/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-java.
MIT License
2.29k stars 1.96k forks source link

[BUG] azure-json throws an illegal state exception when running the GetEmbeddingSample code #41159

Closed Lucky-Vince closed 4 days ago

Lucky-Vince commented 1 month ago

Describe the bug I ran the following sample code locally (azure-open-ai:1.0.0-beta.10) to test the basic functions of EmbeddingModel, but an exception was thrown. After switching to azure-open-ai:1.0.0-beta.8, an illegal character error was prompted.

Exception or Stack Trace azure-open-ai:1.0.0-beta.10: image azure-open-ai:1.0.0-beta.8: image

To Reproduce Steps to reproduce the behavior:

Code Snippet

public static void main(String[] args) {
    OpenAIClient client = new AzureService().buildClient();

    EmbeddingsOptions embeddingsOptions = new EmbeddingsOptions((Arrays.asList("hi")));

    Embeddings embeddings = client.getEmbeddings("TEXT-EMBEDDING-ADA-002", embeddingsOptions);

    for (EmbeddingItem item : embeddings.getData()) {
        System.out.printf("Index: %d.%n", item.getPromptIndex());
        System.out.println("Embedding as base64 encoded string: " +  item.getEmbeddingAsString());
        System.out.println("Embedding as list of floats: ");
        for (Float embedding : item.getEmbedding()) {
            System.out.printf("%f;", embedding);
        }
    }

    EmbeddingsUsage usage = embeddings.getUsage();
    System.out.printf(
            "Usage: number of prompt token is %d and number of total tokens in request and response is %d.%n",
            usage.getPromptTokens(), usage.getTotalTokens());
}

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Setup (please complete the following information):

If you suspect a dependency version mismatch (e.g. you see NoClassDefFoundError, NoSuchMethodError or similar), please check out Troubleshoot dependency version conflict article first. If it doesn't provide solution for the problem, please provide:

Lucky-Vince commented 1 month ago

This is the json data returned by calling the embedding model:

{
    "object": "list",
    "model": "text-embedding-ada-002",
    "data": [
        {
            "index": 0,
            "object": "embedding",
            "embedding": Array[1536]
        }
    ],
    "usage": {
        "prompt_tokens": 1,
        "total_tokens": 1
    }
}
joshfree commented 1 month ago

Thanks for filing this github issue, @Makato-Sino! @alzimmermsft can you please take a look?

/cc @mssfang as fyi

alzimmermsft commented 1 month ago

Thanks for reporting this @Makato-Sino.

After a quick investigation this appears to have been broken before and after the change to azure-json. Looking at the logic used to deserialize and serialize EmbeddingItem it looks to treat the embedding JSON property as a String when the actual definition is a float[].

@mssfang does the service expect this to be a float[] or Base64 encoded string, or both? If the possibility is either, this will need to be customized further to support both cases based on the JSON shape being seen in JSON deserialization. Serialization is another story as that information needs to be determined when creating EmbeddingItem, and right now the only constructor expects Base64 encoding, if both are possible we'll need to add another constructor taking the float[]-based format.

alzimmermsft commented 1 month ago

Upon further investigation, the reason the exception includes END_DOCUMENT is that the attempt to read the array as a string gets the parser into a bad state.

mkemmerz commented 2 weeks ago

Running into the same issue with the beta-10. I can reproduce it pretty easily. I wanted to create an Embeddings-object for one of my test cases:

JsonReader reader = JsonProviders.createReader("""
        {
        "data" : [ {
          "promptIndex" : 0,
          "embedding" : [ ... ],
          "embeddingAsString" : "..."
        } ],
        "usage" : {
          "promptTokens" : 10,
          "totalTokens" : 10
        }
      }""");

Embeddings.fromJson(reader)

leads to:

image
mssfang commented 1 week ago

In the implementation, addEncodingFormat hardcoded encoding_format to Base64. The return response will always return the embedding as a Base64 String, and then we perform the internal conversion to List<Float>. The JSON response will contain only index and embedding(as base64 encoded String).

List<Float> doesn't exist until getEmbedding() is invoked.

We do this as it's more performant for the API to response with Base64, For example, network transfer with Base64 is text-based and easily transferable over text protocols. Handling floats requires careful handling (binary data is converted to a text format) to ensure combability with the HTTP protocol.

github-actions[bot] commented 1 week ago

Hi @Lucky-Vince. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

mssfang commented 6 days ago

As this SDK will never get the float[] from service side, the embedding will always expect to be a base64 encoded String. But we provide Lst<Float> getEmbedding() as the convenience method, we could add the support to deserialize/serialize float[] in EmbeddingItem's toJson and fromJson method as a nice to have feature.