[BUG] graphql queries become slow as data increases

xwang-otterai commented 3 months ago

Describe the bug I've noticed that GraphQL queries are becoming slower as the amount of data increases

project page can't load in five minutes and got 504 timeout

To Reproduce ingest large amount of data

Expected behavior phoenix project page should load under 1minute Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

OS: MacOS
Notebook Runtime: phoenix docker images deployed on ECS
Browser chrome
Version [e.g. 0.0.1]

Additional context Add any other context about the problem here (e.x. a link to a colab)

xwang-otterai commented 3 months ago

mikeldking commented 3 months ago

@xwang-otterai thanks for the ticket. Are you using postgres or sqlite?

xwang-otterai commented 3 months ago

@mikeldking postgres

xwang-otterai commented 3 months ago

@mikeldking This query on average takes 110 seconds

SELECT traces.project_rowid, sum(CAST(spans.attributes #>> $1 AS FLOAT)) AS prompt, sum(CAST(spans.attributes #>> $2 AS FLOAT)) AS completion, coalesce(sum(CAST((spans.attributes #>> $1) AS FLOAT)), $3::INTEGER) + coalesce(sum(CAST((spans.attributes #>> $2) AS FLOAT)), $4::INTEGER) AS total 
FROM traces JOIN spans ON traces.id = spans.trace_rowid 
WHERE spans.start_time >= $5::TIMESTAMP WITH TIME ZONE AND spans.start_time < $6::TIMESTAMP WITH TIME ZONE AND traces.project_rowid IN ($7::INTEGER) G

mikeldking commented 3 months ago

@mikeldking This query on average takes 110 seconds


SELECT traces.project_rowid, sum(CAST(spans.attributes #>> $1 AS FLOAT)) AS prompt, sum(CAST(spans.attributes #>> $2 AS FLOAT)) AS completion, coalesce(sum(CAST((spans.attributes #>> $1) AS FLOAT)), $3::INTEGER) + coalesce(sum(CAST((spans.attributes #>> $2) AS FLOAT)), $4::INTEGER) AS total 

FROM traces JOIN spans ON traces.id = spans.trace_rowid 

WHERE spans.start_time >= $5::TIMESTAMP WITH TIME ZONE AND spans.start_time < $6::TIMESTAMP WITH TIME ZONE AND traces.project_rowid IN ($7::INTEGER) G

Thanks for this @xwang-otterai !

axiomofjoy commented 3 months ago

Thanks for reporting @xwang-otterai. It would help to know more about when the size of your data. If you have database access and are able to execute queries, the following information would be helpful.

How many projects do you have?

select count(*) from projects;

How many traces and spans?

select count(*) from traces;

select count(*) from spans;

How long are your attributes?

SELECT
    percentile_cont(0.01) WITHIN GROUP (ORDER BY attributes_length) AS p1,
    percentile_cont(0.05) WITHIN GROUP (ORDER BY attributes_length) AS p5,
    percentile_cont(0.10) WITHIN GROUP (ORDER BY attributes_length) AS p10,
    percentile_cont(0.25) WITHIN GROUP (ORDER BY attributes_length) AS p25,
    percentile_cont(0.50) WITHIN GROUP (ORDER BY attributes_length) AS p50,
    percentile_cont(0.75) WITHIN GROUP (ORDER BY attributes_length) AS p75,
    percentile_cont(0.90) WITHIN GROUP (ORDER BY attributes_length) AS p90,
    percentile_cont(0.95) WITHIN GROUP (ORDER BY attributes_length) AS p95,
    percentile_cont(0.99) WITHIN GROUP (ORDER BY attributes_length) AS p99
FROM (
    SELECT LENGTH(attributes::text) AS attributes_length
    FROM spans
    ORDER BY random()
    LIMIT 1000
) AS attributes_subquery;

axiomofjoy commented 3 months ago

Some improvements in arize-phoenix 4.20.0, more to come soon.

axiomofjoy commented 2 months ago

Project Page Performance

✅ Ensure we are using dataloaders ✅ Ensure project page query lazily loads data for spans table (since only the traces table is displayed when you first hit the page) ✅ Reduce page size from 100 to 50 spans.

With the above, the initial project page query is down to a borderline acceptable time of ~6 seconds on my long context fixture:

50 root spans + descendants = ~550 spans total
~50 MB response

Most of this time is due to querying attributes. To test, I changed the query to not pull attributes, and the time drops to ~0.3 seconds.

Low-hanging fruit is gone. Way to improve performance further:

Lazily load inputs, outputs, and metadata with dataloaders and use suspense for these columns in the tables. Doesn't make the load faster, but makes it less noticeable to the user.
Return less data when loading the tables. Instead of pulling full attributes when loading the tables, just pull the metadata and the first 200 characters of the input and output to display a preview. Full inputs, outputs, and attributes would be loaded when the user views span details via resolvers on the Span type for input, output and metadata.
Load descendants only when the user explodes traces. Reduce the page size when the traces are exploded.
Add a metadata column and input_preview and output_preview columns containing the first few hundred characters of the input and output to the spans table.

axiomofjoy commented 2 months ago

If anyone is still experiencing issues with large numbers of traces or long-context traces on the latest version of Phoenix, please let us know and re-open this issue. There are further improvements we can make.

Arize-ai / phoenix

[BUG] graphql queries become slow as data increases #4006