Performance Issue: Slow Query Execution Despite Using `LIMIT`

fredrikIOT commented 10 months ago

What type of bug is this?

Performance issue

What subsystems are affected?

Standalone mode

Minimal reproduce step

Populate a test database with data simulating 20000 sensors across 50 locations, with each location having its own table and each sensor having its own tag.
Insert data at a rate of 20000 data points per second (400 for each table), spread over 50 tables.
Run this simulation continuously for a few days.
Execute the query SELECT * FROM "location1" ORDER BY ts DESC LIMIT 100;.

What did you expect to see?

Expected efficient use of LIMIT and OFFSET, where the query execution should be quick, only processing the first 100 rows.

What did you see instead?

The query execution is slow, taking multiple seconds (8634 ms) despite using LIMIT. It seems that the LIMIT clause does not stop the query processing after reaching the limit but only restricts the number of rows returned.

Performance Metrics:

When running SELECT * FROM "location1" ORDER BY ts DESC LIMIT 100;. Metrics:

Returned	Metrics
Selected 100 rows	in 8634 ms

For comparison: When query the count of the table using SELECT count(*) FROM "location1";:

Returned	Metrics	Number of rows (result of the query)
Selected 1 rows	in 2333 ms	181459600

SQL Analyze:

EXPLAIN ANALYZE 
SELECT * FROM "location1" limit 5 offset 1

Result:	plan_type	plan
Plan with Metrics	MergeScanExec: peers=[4475355922432(1042, 0), ], metrics=[output_rows=100, ready_time=87.566768ms, finish_time=4.524580126s, first_consume_time=4.524555159s]

What operating system did you use?

Docker (Ubuntu)

What version of GreptimeDB did you use?

docker image: greptime/greptimedb:nightly-20240126-8bade8f8e4

Relevant log output and stack trace

No response

waynexia commented 10 months ago

Hi @fredrikIOT, appreciate your detailed feedback!

I noticed you also run SELECT * FROM "location1" limit 5 offset 1 and it's slower than SELECT count(*) FROM "location1"; I'm wondering if it is reproducible? And how large is the result set? According to the analyze result it has output_rows=100 100 rows. Would the SQL SELECT * FROM "location1" limit 5 offset 1 return 100 rows (without the ANALYZE) or 5 rows

MichaelScofield commented 10 months ago

@fredrikIOT Can you share the SQL of creating table "location1"?

fredrikIOT commented 10 months ago

Hi @fredrikIOT, appreciate your detailed feedback!

I noticed you also run SELECT * FROM "location1" LIMIT 5 offset 1 and it's slower than SELECT count(*) FROM "location1"; I'm wondering if it is reproducible? And how large is the result set? According to the analyze result it has output_rows=100 100 rows. Would the SQL SELECT * FROM "location1" limit 5 offset 1 return 100 rows (without the ANALYZE) or 5 rows

@waynexia I need to correct an error in my previous message. I referenced the wrong query and result. Here's the accurate information: Correct Query: EXPLAIN ANALYZE SELECT * FROM "location1" ORDER BY ts DESC LIMIT 5 offset 1 Correct Result: MergeScanExec: peers=[4629974745088(1078, 0), ], metrics=[output_rows=5, ready_time=379.715193ms, finish_time=9.498069549s, first_consume_time=9.498045694s]

While reviewing this, I've noticed the significant impact of the ORDER BY clause on the query performance. Here are some observations:

SELECT * FROM "location1" ORDER BY ts DESC LIMIT 5 offset 1 uses 5372 ms
SELECT * FROM "location1" LIMIT 5 offset 1 uses 213 ms
SELECT COUNT(*) FROM "location1"; uses 2787 ms

So, I guess the question of whether COUNT takes more time than LIMIT is no longer valid. However, it is reproducible that COUNT is slower when using ORDER BY.

fredrikIOT commented 10 months ago

@fredrikIOT Can you share the SQL of creating table "location1"?

@MichaelScofield Here's the SQL for creating the "location1" table:

CREATE TABLE IF NOT EXISTS "location1" (
  "location_uid" STRING NULL,
  "room_uid" STRING NULL,
  "log_uid" STRING NULL,
  "sensor_name" STRING NULL,
  "value" DOUBLE NULL,
  "ts" TIMESTAMP(3) NOT NULL,
  TIME INDEX ("ts"),
  PRIMARY KEY ("location_uid", "room_uid", "log_uid", "sensor_name")
) ENGINE=mito WITH(regions = 1)

As I mentioned in my previous message, the performance issue seems to be primarily due to the ORDER BY clause. However, having the table structure might provide further insights into any other potential performance implications.

waynexia commented 10 months ago

So, I guess the question of whether COUNT takes more time than LIMIT is no longer valid. However, it is reproducible that COUNT is slower when using ORDER BY.

Yes, that's correct. The main reason that SELECT count(*) FROM "location1"; runs faster than SELECT * FROM "location1" ORDER BY ts DESC LIMIT 100; is the ORDER BY ts. Unfortunately we haven't optimized for this scenario -- though we know it's important for time series scenarios. We used to have one optimization that could speed up ORDER BY ts by a fascinating percentage, but it's not adapted yet after a major refactor :face_with_peeking_eye:

We can keep this issue open as a tracker.

killme2008 commented 8 months ago

Do we have any plan to optimize for this case? @waynexia @evenyag

evenyag commented 8 months ago

I'm afraid that I don't have enough time to finish it in v0.8, maybe I can optimize some cases in v0.9.

evenyag commented 1 week ago

I'm afraid that I don't have enough time to finish it in v0.8, maybe I can optimize some cases in v0.9.

We implemented some optimize rules and execution plans for this case.

v0.10.0 will include these changes. Now order by time index will sort each time window independently instead of sorting all rows in a table.

GreptimeTeam / greptimedb