[BUG] non-ASCII symbols in some trace attributes are garbled

mkhludnev commented 6 months ago

Describe the bug Letters with diacritics mark are wrong in some traces.

To Reproduce

invoked instrumented Langchain with some diacritics symbols: "naïve façade café"
check Phoenix UI

Expected behavior In traces table I want to see original letters as is "naïve façade café"

Actually some of the letters are garbled by json.dumps(d) As you can see below, sometimes it's broken in input column and sometimes in output. Also interesting, details screen is always correct.

Screenshots

Additional context JQuery response for traces table: jquery-trace-dump.json

I noticed that dictionaries in input, output attributes encoded as json. However, input and output are not decoded from json.

mkhludnev commented 6 months ago

attempting a no_brainer: added input,output values into phoenix.trace.otel._JSON_STRING_ATTRIBUTES, but it causes problem with jQuery rendering

\graphql\type\scalars.py", line 177, in serialize_string
    raise GraphQLError("String cannot represent value: " + inspect(output_value))
graphql.error.graphql_error.GraphQLError: String cannot represent value: {'question': 'naïve façade café', 'chat_history': [], 'context': '....

Perhaps decoding should detect OpenInferenceMimeTypeValues.JSON?

mkhludnev commented 6 months ago

that what we have in decode as otlp_span.attributes

[key: "openinference.span.kind"
value {
  string_value: "CHAIN"
}
, key: "input.value"
value {
  string_value: "{\"chat_history\": [], \"question\": \"na\\u00efve fa\\u00e7ade caf\\u00e9\"}"
}
, key: "input.mime_type"
value {
  string_value: "application/json"
}
, key: "output.value"
value {
  string_value: "naïve façade café"
}
]

I think _load_json_strings should mind input.mime_type and output.mime_type

axiomofjoy commented 6 months ago

@mkhludnev Thanks for pointing this out! It looks like the issue is that diacritic symbols are being serialized to a non-readable format in JSON, which we're then displaying in the UI. We'll need to look into other ways to serialize. Alternatively, we've talked about parsing the JSON and displaying the individual message in the input field for LLM spans instead of just showing the JSON object itself.

axiomofjoy commented 6 months ago

Thanks for the thorough investigation!

axiomofjoy commented 6 months ago

@mkhludnev In the screenshot you sent, it looks like you are hitting this issue with retriever and LLM span kinds. Any others? Also generic chain span kinds?

mkhludnev commented 6 months ago

stepping through decode() in otel.py. Span.attrubutes seems decoded well:

Span.attributes=
{'output.value': 'naïve façade café', 'input.mime_type': <MimeType.JSON: 'application/json'>, 'input.value': '{"chat_history": [], "question": "na\\u00efve fa\\u00e7ade caf\\u00e9"}'}

I suppose consumers should just watch for in/output.mime_type and interpret it correctly. Thus, it might be UI issue, and honestly it's not so important to me. I encounter this problem in dataframes for Evaluations. So, far it seems like a clue for checking whether phoenix.session.evaluation.get_qa_with_reference/get_retrieved_documents checks mime-type

axiomofjoy commented 6 months ago

Do you have a code snippet you're able to share, or do you hit the issue anytime you have diacritics in the input, output, and documents?

mkhludnev commented 6 months ago

hit the issue anytime you have diacritics in the input, output, and documents?

I do. It reproduces every time.

mkhludnev commented 6 months ago

get_retrieved_documents() - returns non empty DF, and it seems ok.
get_qa_with_reference() - returns empty dataframe. It joins/contacts two non empty spans frames and got empty join result.

It seems caused by the keys mismatch, not by this diacritics symbols.
my chain is special in terms of input - it receives dictionary not string. May this cause all of these issues.

Nevertheless, so far it seems like a minor UI bug.

mikeldking commented 6 months ago

There might be a solution where we set JSON.dumps uses ascii_ensure=False. https://docs.python.org/3/library/json.html#basic-usage

axiomofjoy commented 5 months ago

This issue has been resolved in the latest versions of our instrumentation libraries.

openinference-instrumentation-openai==0.1.6
openinference-instrumentation-llama-index==1.4.1
openinference-instrumentation-dspy==0.1.9
openinference-instrumentation-langchain==0.1.16
openinference-instrumentation-bedrock==0.1.6
openinference-instrumentation-mistralai==0.0.7

Arize-ai / phoenix

[BUG] non-ASCII symbols in some trace attributes are garbled #2824