apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
https://paimon.apache.org/
Apache License 2.0
2.28k stars 912 forks source link

[Feature] When the MergeEngine of the dimension table is aggregation / partial-update, there is no need to forcibly enable changelog-producer. #3868

Open liming30 opened 1 month ago

liming30 commented 1 month ago

Search before asking

Motivation

Currently, two built-in compaction strategies are provided, but they are mainly set automatically based on MergeEngine / ChangelogProducer / DeletionVectors. In the scenario of using dimension tables, if the MergeEngine of the dimension table is aggregation/partial-update, we have to set CHANGELOG_PRODUCER to lookup. But when using PrimaryKeyPartialLookupTable, changelog is useless.

Therefore, I hope to add a compact-strategy configuration, so that CHANGELOG_PRODUCER can be enabled only when necessary.

At the same time, for FullCacheLookupTable, even if the user does not enable CHANGELOG_PRODUCER, we can also obtain the changelog through IncrementalDiffSplitRead.

Solution

I would like to do it in two parts:

  1. add the compact-strategy configuration, so that other types of MergeEngine can be compacted quickly without writing changelog.

  2. adjust the refresh strategy of FullCacheLookupTable to support streaming updates of other types of MergeEngine without enable CHANGELOG_PRODUCER.

Anything else?

No response

Are you willing to submit a PR?

JingsongLi commented 1 month ago

What case do you want to solve? Lookup Join for partial-update table without changelog-producer? If this is your requirement, can we just modify Flink LookupJoin Function?

liming30 commented 1 month ago

What case do you want to solve? Lookup Join for partial-update table without changelog-producer? If this is your requirement, can we just modify Flink LookupJoin Function?

@JingsongLi the dim table usually has no streaming consumption jobs, so generating a changelog is useless. In most cases, we will ensure that the primary key of the dim table is the same as the key of the lookup-join, so we will use PrimaryKeyPartialLookupTable to perform the lookup.

By adding a prefer-compaction-strategy, I think most of the code can be reused. If we perform the merge operation in the LookupJoin Function, this will result in a merge overhead in each job.

JingsongLi commented 1 week ago

Hi @liming30 , so what you want to support is just for partial-update&agg table with lookup but without changelog?

Why? The most cost is in lookup, the cost of changelog is not so high.

liming30 commented 1 week ago

As an issue following #3905 , dim tables do not require streaming consumption in most cases, so there is no need to generate changelog files to reduce write IO.

When the primary key of the lookup join is exactly the same as the primary key of the table, we can use PrimaryKeyPartialLookupTable without reading the changelog file. When the primary key of the lookup join is inconsistent with the primary key of the table, we can use FullCacheLookupTable based on the diff generated by compaction, so I hope to relax the restrictions on FullCacheLookupTable for dim tables.