Closed Lucky-Vince closed 4 days ago
This is the json data returned by calling the embedding model:
{
"object": "list",
"model": "text-embedding-ada-002",
"data": [
{
"index": 0,
"object": "embedding",
"embedding": Array[1536]
}
],
"usage": {
"prompt_tokens": 1,
"total_tokens": 1
}
}
Thanks for filing this github issue, @Makato-Sino! @alzimmermsft can you please take a look?
/cc @mssfang as fyi
Thanks for reporting this @Makato-Sino.
After a quick investigation this appears to have been broken before and after the change to azure-json
. Looking at the logic used to deserialize and serialize EmbeddingItem
it looks to treat the embedding
JSON property as a String when the actual definition is a float[]
.
@mssfang does the service expect this to be a float[]
or Base64 encoded string, or both? If the possibility is either, this will need to be customized further to support both cases based on the JSON shape being seen in JSON deserialization. Serialization is another story as that information needs to be determined when creating EmbeddingItem
, and right now the only constructor expects Base64 encoding, if both are possible we'll need to add another constructor taking the float[]
-based format.
Upon further investigation, the reason the exception includes END_DOCUMENT
is that the attempt to read the array as a string gets the parser into a bad state.
Running into the same issue with the beta-10. I can reproduce it pretty easily. I wanted to create an Embeddings
-object for one of my test cases:
JsonReader reader = JsonProviders.createReader("""
{
"data" : [ {
"promptIndex" : 0,
"embedding" : [ ... ],
"embeddingAsString" : "..."
} ],
"usage" : {
"promptTokens" : 10,
"totalTokens" : 10
}
}""");
Embeddings.fromJson(reader)
leads to:
In the implementation, addEncodingFormat
hardcoded encoding_format
to Base64. The return response will always return the embedding as a Base64 String, and then we perform the internal conversion to List<Float>
. The JSON response will contain only index
and embedding(as base64 encoded String)
.
List<Float>
doesn't exist until getEmbedding()
is invoked.
We do this as it's more performant for the API to response with Base64, For example, network transfer with Base64 is text-based and easily transferable over text protocols. Handling floats requires careful handling (binary data is converted to a text format) to ensure combability with the HTTP protocol.
Hi @Lucky-Vince. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.
As this SDK will never get the float[]
from service side, the embedding
will always expect to be a base64 encoded String. But we provide Lst<Float> getEmbedding()
as the convenience method, we could add the support to deserialize/serialize float[]
in EmbeddingItem's toJson
and fromJson
method as a nice to have feature.
Describe the bug I ran the following sample code locally (azure-open-ai:1.0.0-beta.10) to test the basic functions of EmbeddingModel, but an exception was thrown. After switching to azure-open-ai:1.0.0-beta.8, an illegal character error was prompted.
Exception or Stack Trace azure-open-ai:1.0.0-beta.10: azure-open-ai:1.0.0-beta.8:
To Reproduce Steps to reproduce the behavior:
Code Snippet
Expected behavior A clear and concise description of what you expected to happen.
Screenshots If applicable, add screenshots to help explain your problem.
Setup (please complete the following information):
If you suspect a dependency version mismatch (e.g. you see
NoClassDefFoundError
,NoSuchMethodError
or similar), please check out Troubleshoot dependency version conflict article first. If it doesn't provide solution for the problem, please provide:mvn dependency:tree -Dverbose
)