apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
https://paimon.apache.org/
Apache License 2.0
2.43k stars 956 forks source link

[core] Optizime IN filter pushdown to snapshot/tag/schema system tables #4436

Closed xuzifu666 closed 2 weeks ago

xuzifu666 commented 2 weeks ago

Purpose

Linked issue: close #xxx

  1. Currently IN filter cannot pushdown to snapshot/tag/schema system tables and IN would tramsform to OR(contians all Equal LeafPredicate), if user query with IN filter would cost more unnessary IO in a large number of snapshots/tag/schema;
  2. TagTable predicate specified as LeafPredicate which cannot resolve other kind of Predicate, this pr had improved it.
  3. Add APIs for get snapshots/tags/schemas which get multiple numbers for a list argument.

such like sql: select snapshot_id, schema_id, commit_user from T$snapshots where snapshot_id in (1, 3); before the pr would query all snapshots files firstly, then filter with 1 and 3; after the pr would query only snapshots file with 1 and 3, not query all snapshots file, which can reduce unnessary IO. note: if filter contains other fields would handle regress as before( like: select snapshot_id, schema_id, commit_user from T$snapshots where snapshot_id in (1, 3) or/and commit_user='xxx'; ), just work only snapshot_id IN filter.

Tests

API and Format

Documentation

xuzifu666 commented 2 weeks ago

@wwj6591812 Thanks review,had addressed.

wwj6591812 commented 2 weeks ago

+1

LinMingQiang commented 2 weeks ago

+1