apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
https://paimon.apache.org/
Apache License 2.0
2.27k stars 908 forks source link

[Bug] Symptom oom is displayed when hive reads data using limit #1253

Open Dkbei opened 1 year ago

Dkbei commented 1 year ago

Search before asking

Paimon version

Scenario description:

  1. A partition contains 17 million data
  2. 1 bucket
  3. Query script: select * from dwd.paimon_table_test where dt='20211231' limit 10; 4.Number of files in the partition: 583 Average file size: 2.6mb

Abnormal information:

image

When the limit command is used to query data, fetch cannot be used. Data is read directly through mapreduce

Compute Engine

hive :cdh-6.3.2 hive 2.1.1 Paimon: master branches

Minimal reproduce step

A large amount of data is written into the paimon table from the hive table to generate multiple small files

What doesn't meet your expectations?

The limit operation should not cause oom, and the limit operation can fetch directly

Anything else?

No response

Are you willing to submit a PR?

JingsongLi commented 1 year ago

A large amount of data is written into the paimon table from the hive table to generate multiple small files

No compaction? Hive writer is no compaction. You need to launch Flink Job to do compaction.

Dkbei commented 1 year ago

A large amount of data is written into the paimon table from the hive table to generate multiple small files

No compaction? Hive writer is no compaction. You need to launch Flink Job to do compaction.

If the hive table contains a large number of small files, oom does not occur for the same statement。

JingsongLi commented 1 year ago

A large amount of data is written into the paimon table from the hive table to generate multiple small files

No compaction? Hive writer is no compaction. You need to launch Flink Job to do compaction.

If the hive table contains a large number of small files, oom does not occur for the same statement。

Yes, they are not small files only, they are unmerged files, Hive writer has no compaction, only Flink Spark writers have compaction.