Arize-ai / phoenix

AI Observability & Evaluation
https://docs.arize.com/phoenix
Other
3.07k stars 223 forks source link

[BUG] DateTime normalization bug in to_datetime #3617

Open Nabeegh-Ahmed opened 4 weeks ago

Nabeegh-Ahmed commented 4 weeks ago

Describe the bug When constructing a TraceDataset it creates an error where the start_time normalization fails with an error:

ValueError: time data "2024-06-21 18:23:34" doesn't match format "%Y-%m-%d %H:%M:%S.%f"

The time is stored as is using OTEL with no changes. I have tracked the error trace and it is produced in normalize_timestamps inside datetime_utils.py:52. The function uses the to_datetime utility in pandas. Essentially arising when the passed datetime is an object rather than a string or timestamp.

To Reproduce Steps to reproduce the behavior:

  1. Fetch spans from database and convert them to a list
  2. Convert the list to a TraceDataset as follows:
    TraceDataset(
    json_lines_to_df([json.dumps(process_span(span)) for span in spans])
    )

    Inside of process_span looks like

    def process_trace(trace: dict):
        trace["start_time"] = str(trace["start_time"])
        trace["end_time"] = str(trace["end_time"])
        for event in trace["events"]:
            event["timestamp"] = str(event["timestamp"])
        return trace

Expected behavior The dataset should be constructed without any errors.

Screenshots image

Even though the screenshot above indicates the date as: 2024-06-21 18:23:34, the date passed to it is this: 2024-06-21T18:23:34.801+00:00. Indicating a failure in date conversions.

Environment (please complete the following information):

Additional context I was able to find an easy fix for this issue. I am opening a PR for it. Keeping this issue for discussion and tracking.

mikeldking commented 3 weeks ago

@Nabeegh-Ahmed totally understand your needs. I think the thing I worry about here is that your way of storing traces outside of phoenix is not our intended design so I'm a bit scared of code that breaks that assumption a bit. Is it possible for you to just use %Y-%m-%d %H:%M:%S.%f when ingesting into phoenixes for now?

Nabeegh-Ahmed commented 3 weeks ago

@mikeldking I have not really made changes to how openinference works with dates. I am using _decode_unix_nano and just storing the result obtained from it. I can currently make do with using the %Y-%m-%d %H:%M:%S.%f. But just thinking about, since I didn't update the time handling logic, this error should not have occurred.

axiomofjoy commented 3 weeks ago

@Nabeegh-Ahmed Can you send a snippet for reproducing step 1?

Fetch spans from database and convert them to a list

Nabeegh-Ahmed commented 3 weeks ago

@axiomofjoy I have a mongodb database where I am storing traces. I am doing it exactly like phoenix does it (otel.py). Here's how I do step 1:

traces = (
        get_mongo_db()
        .get_collection("spans")
        .find(
            {
                "start_time": {"$gte": start_time},
                "end_time": {"$lte": end_time},
            }
            | ({"project_name": project_name} if project_name is not None else {})
        )
    )
    traces = list(traces)
axiomofjoy commented 2 weeks ago

Hey @Nabeegh-Ahmed, I am not able to reproduce this issue on latest main. Can you upgrade your version of Phoenix with pip install -U "arize-phoenix==4.5.0" to the latest version and try out the following script? (this script assumes you have trace data in Phoenix)

from phoenix.trace.trace_dataset import TraceDataset
from phoenix.session.client import Client

client = Client()
spans_df = client.query_spans()
# the script works for me regardless of whether the next two lines are commented out
spans_df.start_time = spans_df.start_time.map(
    lambda dt: dt.strftime("%Y-%m-%d %H:%M:%S")
)
spans_df.end_time = spans_df.end_time.map(lambda dt: dt.strftime("%Y-%m-%d %H:%M:%S"))
print(f"{spans_df.start_time.map(type).unique()=}")
print(f"{spans_df.end_time.map(type).unique()=}")
print(f"{spans_df.start_time=}")
print(f"{spans_df.end_time=}")
trace_ds = TraceDataset(spans_df)
print(f"{trace_ds=}")
axiomofjoy commented 2 weeks ago

The script also appears to work for me when using ISO 8601 formatted timestamps.

from phoenix.trace.trace_dataset import TraceDataset
from phoenix.session.client import Client

client = Client()
spans_df = client.query_spans()
spans_df.start_time = spans_df.start_time.map(lambda dt: dt.isoformat())
spans_df.end_time = spans_df.end_time.map(lambda dt: dt.isoformat())
print(f"{spans_df.start_time.map(type).unique()=}")
print(f"{spans_df.end_time.map(type).unique()=}")
print(f"{spans_df.start_time=}")
print(f"{spans_df.end_time=}")
trace_ds = TraceDataset(spans_df)
print(f"{trace_ds=}")