GreptimeTeam / greptimedb

An open-source, cloud-native, unified time series database for metrics, logs and events with SQL/PromQL supported. Available on GreptimeCloud.
https://greptime.com/
Apache License 2.0
4.26k stars 307 forks source link

Projection pushdown regression in 0.5.0-nightly #2730

Closed evenyag closed 6 months ago

evenyag commented 11 months ago

What type of bug is this?

Performance issue

What subsystems are affected?

Standalone mode, Datanode

What happened?

The following SQL has performance regression in 0.5.0-nightly

SELECT mean(usage_user), date_trunc('hour', ts) from cpu where ts >= '2023-06-12T23:35:31Z' and ts < '2023-06-13T11:35:31Z' group by date_trunc('hour', ts),hostname;

What operating system did you use?

Unrelated

Relevant log output and stack trace

2023-11-10T09:10:46.524076Z DEBUG mito2::read::scan_region: Seq scan region 4398046511104(1024, 0), request: ScanRequest { sequence: None, projection: Some([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]), filters: [Expr { df_expr: BinaryExpr(BinaryExpr { left: Column(Column { relation: None, name: "ts" }), op: GtEq, right: Literal(TimestampMillisecond(1686612931000, None)) }) }, Expr { df_expr: BinaryExpr(BinaryExpr { left: Column(Column { relation: None, name: "ts" }), op: Lt, right: Literal(TimestampMillisecond(1686656131000, None)) }) }], output_ordering: None, limit: None }, memtables: 0, ssts_to_read: 1, total_ssts: 5

How can we reproduce the bug?

Creates this table

CREATE TABLE IF NOT EXISTS `cpu` (
  `hostname` STRING NULL,
  `region` STRING NULL,
  `datacenter` STRING NULL,
  `rack` STRING NULL,
  `os` STRING NULL,
  `arch` STRING NULL,
  `team` STRING NULL,
  `service` STRING NULL,
  `service_version` STRING NULL,
  `service_environment` STRING NULL,
  `usage_user` BIGINT NULL,
  `usage_system` BIGINT NULL,
  `usage_idle` BIGINT NULL,
  `usage_nice` BIGINT NULL,
  `usage_iowait` BIGINT NULL,
  `usage_irq` BIGINT NULL,
  `usage_softirq` BIGINT NULL,
  `usage_steal` BIGINT NULL,
  `usage_guest` BIGINT NULL,
  `usage_guest_nice` BIGINT NULL,
  `ts` TIMESTAMP(3) NOT NULL,
  TIME INDEX (`ts`),
  PRIMARY KEY (`hostname`, `region`, `datacenter`, `rack`, `os`, `arch`, `team`, `service`, `service_version`, `service_environment`)
)

Inserts some data and run

SELECT mean(usage_user), date_trunc('hour', ts) from cpu where ts >= '2023-06-12T23:35:31Z' and ts < '2023-06-13T11:35:31Z' group by date_trunc('hour', ts),hostname;

You can see all columns are selected in 0.5.0-nightly. But in 0.4.2, we only select necessary columns.

waynexia commented 11 months ago

Upstream fix: https://github.com/apache/arrow-datafusion/pull/8119

killme2008 commented 11 months ago

Looks like https://github.com/apache/arrow-datafusion/pull/8119 already merged, can we upgrade the datafusion to fix this issue? @waynexia

killme2008 commented 9 months ago

Upstream fix: apache/arrow-datafusion#8119

Please don't forget to upgrade the datafusion.

tisonkun commented 6 months ago

@waynexia did we upgrade datafusion? Or it'll be fixed by https://github.com/GreptimeTeam/greptimedb/pull/3554.

waynexia commented 6 months ago

This is fixed.