MaibornWolff / database-performance-comparison

Apache License 2.0
27 stars 7 forks source link

Analyze/improve query performance for Yugabyte #7

Open swoehrl-mw opened 2 years ago

swoehrl-mw commented 2 years ago

With the newest version of YugabyteDB and some optimizations insert performance for YSQL has increased dramatically to about 295000 inserts/s in our testcase (copy mode, batch size 1000), bringing it up to par with the YCQL interface (which also increased slightly to about 290000).

But the query performance is unexpectedly slow. For the usecase with 500 million rows I get the following query times for our test queries:

One more issue: When creating the index events_device_ts sometimes the following error happens: psycopg2.errors.InternalError_: Aborted: ERROR: Query error: schema version mismatch for table 000033e6000030008000000000004104: expected 1, got 0. The index still seems to get created as running the CREATE INDEX statement again yields NOTICE: relation "events_device_ts" already exists, skipping.

For reference the commands:

@FranckPachot I hope your offer of help is still valid? I would be grateful for some pointers on how to deal with the temp_file_limit and any ideas on how to speed up the other queries. With the latest optimizations insert performance is formidable, so I would love to get query performance to match.

FranckPachot commented 2 years ago

I'll look at this. About the error when creating the index, I don't expect it to happen if there are no other DDL appending at the same time.

FranckPachot commented 2 years ago
swoehrl-mw commented 2 years ago

Hi @FranckPachot. Thank you for your suggestions. I did another test and made sure the index was created correctly.

postgres=# explain SELECT device_id, temperature from (SELECT device_id, temperature, timestamp=max(timestamp) over (partition by device_id) newest FROM events) e where newest;
                                              QUERY PLAN                                              
------------------------------------------------------------------------------------------------------
 Subquery Scan on e  (cost=0.00..141.50 rows=500 width=36)
   Filter: e.newest
   ->  WindowAgg  (cost=0.00..131.50 rows=1000 width=37)
         ->  Index Only Scan using events_device_ts on events  (cost=0.00..114.00 rows=1000 width=44)
(4 rows)
FranckPachot commented 2 years ago

Yes, I thought that the PostgreSQL WindowAgg was smarter as there's no need to materialize all rows into temp to get the max() over a sorted set. I got it work with set temp_file_limit='50GB'. However, here is something smarter. This should get the same result, with limited reads (it gets to the last device/last timestamp, then gets to the next device/last timestamp... with fast IndexOnlyScan returning one row only for each device):

with recursive device_last_event as (
 -- start at the last device_id / last timestamp
 (
  select
   last_device_last_event.device_id,
   last_device_last_event.temperature
   from events last_device_last_event
  order by
   last_device_last_event.device_id desc,
   last_device_last_event.timestamp desc
  limit 1
 )
union all
 -- from this  last device_id / last timestamp get the next one
 select
  next_device_last_event.device_id,
  next_device_last_event.temperature
 from device_last_event
 , lateral
 (
  select device_id,temperature from events
  where device_id < device_last_event.device_id
  order by device_id desc, timestamp desc limit 1
 )
 as next_device_last_event
) 
select * from device_last_event
order by device_id,temperature;

This uses a well know postgres workaround for loose index scan and works in YugabyteDB 2.15.1 (was not in 2.15.0)

swoehrl-mw commented 2 years ago

Hi @FranckPachot. I've finally found the time to test your workaround and can confirm the query is very fast (0.08s, basically what I would expect of an index-only query). I've updated the repo README with updated results for Yugabyte. But to be fair I have not included that last result in the comparison table but made a notation of it in the explanations. Thanks again for your help.

FranckPachot commented 1 year ago

Hi @swoehrl-mw the latest release of YugabyteDB (2.17.1.0-b439) has implemented loose index scan for DISTINCT, which means that the newest-per-device can be blazing fast if written as:

select * from
(
   select distinct device_id from events where device_id>'' -- need a where clause to have IndexScan
) devices
, lateral (
   select temperature from events
   where device_id=devices.device_id
   order by timestamp desc limit 1
  ) as events
;

I don't think this optimization helps on PostgreSQL. Maybe on Timescale DB.

swoehrl-mw commented 1 year ago

Hi @FranckPachot. Thanks for the update. I will include that in the next benchmark round.