[Bug] Symptom oom is displayed when hive reads data using limit

Dkbei commented 1 year ago

Search before asking

[X] I searched in the issues and found nothing similar.

Paimon version

Scenario description:

A partition contains 17 million data
1 bucket
Query script: select * from dwd.paimon_table_test where dt='20211231' limit 10; 4.Number of files in the partition: 583 Average file size: 2.6mb

Abnormal information:

When the limit command is used to query data, fetch cannot be used. Data is read directly through mapreduce

Compute Engine

hive ：cdh-6.3.2 hive 2.1.1 Paimon: master branches

Minimal reproduce step

A large amount of data is written into the paimon table from the hive table to generate multiple small files

What doesn't meet your expectations?

The limit operation should not cause oom, and the limit operation can fetch directly

Anything else?

No response

Are you willing to submit a PR?

[X] I'm willing to submit a PR!

JingsongLi commented 1 year ago

A large amount of data is written into the paimon table from the hive table to generate multiple small files

No compaction? Hive writer is no compaction. You need to launch Flink Job to do compaction.

Dkbei commented 1 year ago

A large amount of data is written into the paimon table from the hive table to generate multiple small files

No compaction? Hive writer is no compaction. You need to launch Flink Job to do compaction.

If the hive table contains a large number of small files, oom does not occur for the same statement。

JingsongLi commented 1 year ago

A large amount of data is written into the paimon table from the hive table to generate multiple small files

No compaction? Hive writer is no compaction. You need to launch Flink Job to do compaction.

If the hive table contains a large number of small files, oom does not occur for the same statement。

Yes, they are not small files only, they are unmerged files, Hive writer has no compaction, only Flink Spark writers have compaction.

apache / paimon