Arize-ai / phoenix

AI Observability & Evaluation
https://docs.arize.com/phoenix
Other
3.5k stars 259 forks source link

[BUG] graphql queries become slow as data increases #4006

Closed xwang-otterai closed 1 month ago

xwang-otterai commented 2 months ago

Describe the bug I've noticed that GraphQL queries are becoming slower as the amount of data increases

project page can't load in five minutes and got 504 timeout

To Reproduce ingest large amount of data

Expected behavior phoenix project page should load under 1minute Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

Additional context Add any other context about the problem here (e.x. a link to a colab)

xwang-otterai commented 2 months ago

related: https://github.com/Arize-ai/phoenix/issues/3917

mikeldking commented 2 months ago

@xwang-otterai thanks for the ticket. Are you using postgres or sqlite?

xwang-otterai commented 2 months ago

@mikeldking postgres

xwang-otterai commented 2 months ago

@mikeldking This query on average takes 110 seconds

SELECT traces.project_rowid, sum(CAST(spans.attributes #>> $1 AS FLOAT)) AS prompt, sum(CAST(spans.attributes #>> $2 AS FLOAT)) AS completion, coalesce(sum(CAST((spans.attributes #>> $1) AS FLOAT)), $3::INTEGER) + coalesce(sum(CAST((spans.attributes #>> $2) AS FLOAT)), $4::INTEGER) AS total 
FROM traces JOIN spans ON traces.id = spans.trace_rowid 
WHERE spans.start_time >= $5::TIMESTAMP WITH TIME ZONE AND spans.start_time < $6::TIMESTAMP WITH TIME ZONE AND traces.project_rowid IN ($7::INTEGER) G
mikeldking commented 1 month ago

@mikeldking This query on average takes 110 seconds


SELECT traces.project_rowid, sum(CAST(spans.attributes #>> $1 AS FLOAT)) AS prompt, sum(CAST(spans.attributes #>> $2 AS FLOAT)) AS completion, coalesce(sum(CAST((spans.attributes #>> $1) AS FLOAT)), $3::INTEGER) + coalesce(sum(CAST((spans.attributes #>> $2) AS FLOAT)), $4::INTEGER) AS total 

FROM traces JOIN spans ON traces.id = spans.trace_rowid 

WHERE spans.start_time >= $5::TIMESTAMP WITH TIME ZONE AND spans.start_time < $6::TIMESTAMP WITH TIME ZONE AND traces.project_rowid IN ($7::INTEGER) G

Thanks for this @xwang-otterai !

axiomofjoy commented 1 month ago

Thanks for reporting @xwang-otterai. It would help to know more about when the size of your data. If you have database access and are able to execute queries, the following information would be helpful.

select count(*) from projects;
select count(*) from traces;
select count(*) from spans;
SELECT
    percentile_cont(0.01) WITHIN GROUP (ORDER BY attributes_length) AS p1,
    percentile_cont(0.05) WITHIN GROUP (ORDER BY attributes_length) AS p5,
    percentile_cont(0.10) WITHIN GROUP (ORDER BY attributes_length) AS p10,
    percentile_cont(0.25) WITHIN GROUP (ORDER BY attributes_length) AS p25,
    percentile_cont(0.50) WITHIN GROUP (ORDER BY attributes_length) AS p50,
    percentile_cont(0.75) WITHIN GROUP (ORDER BY attributes_length) AS p75,
    percentile_cont(0.90) WITHIN GROUP (ORDER BY attributes_length) AS p90,
    percentile_cont(0.95) WITHIN GROUP (ORDER BY attributes_length) AS p95,
    percentile_cont(0.99) WITHIN GROUP (ORDER BY attributes_length) AS p99
FROM (
    SELECT LENGTH(attributes::text) AS attributes_length
    FROM spans
    ORDER BY random()
    LIMIT 1000
) AS attributes_subquery;
axiomofjoy commented 1 month ago

Some improvements in arize-phoenix 4.20.0, more to come soon.

axiomofjoy commented 1 month ago

Project Page Performance

✅ Ensure we are using dataloaders ✅ Ensure project page query lazily loads data for spans table (since only the traces table is displayed when you first hit the page) ✅ Reduce page size from 100 to 50 spans.

With the above, the initial project page query is down to a borderline acceptable time of ~6 seconds on my long context fixture:

Most of this time is due to querying attributes. To test, I changed the query to not pull attributes, and the time drops to ~0.3 seconds.

Low-hanging fruit is gone. Way to improve performance further:

axiomofjoy commented 1 month ago

If anyone is still experiencing issues with large numbers of traces or long-context traces on the latest version of Phoenix, please let us know and re-open this issue. There are further improvements we can make.