Open liming30 opened 1 month ago
What case do you want to solve? Lookup Join for partial-update table without changelog-producer? If this is your requirement, can we just modify Flink LookupJoin Function?
What case do you want to solve? Lookup Join for partial-update table without changelog-producer? If this is your requirement, can we just modify Flink LookupJoin Function?
@JingsongLi the dim table usually has no streaming consumption jobs, so generating a changelog is useless. In most cases, we will ensure that the primary key of the dim table is the same as the key of the lookup-join
, so we will use PrimaryKeyPartialLookupTable
to perform the lookup.
By adding a prefer-compaction-strategy
, I think most of the code can be reused. If we perform the merge operation in the LookupJoin Function
, this will result in a merge overhead in each job.
Hi @liming30 , so what you want to support is just for partial-update&agg table with lookup but without changelog?
Why? The most cost is in lookup, the cost of changelog is not so high.
As an issue following #3905 , dim tables do not require streaming consumption in most cases, so there is no need to generate changelog files to reduce write IO.
When the primary key of the lookup join is exactly the same as the primary key of the table, we can use PrimaryKeyPartialLookupTable
without reading the changelog file. When the primary key of the lookup join is inconsistent with the primary key of the table, we can use FullCacheLookupTable
based on the diff generated by compaction
, so I hope to relax the restrictions on FullCacheLookupTable
for dim tables.
Search before asking
Motivation
Currently, two built-in compaction strategies are provided, but they are mainly set automatically based on
MergeEngine
/ChangelogProducer
/DeletionVectors
. In the scenario of using dimension tables, if theMergeEngine
of the dimension table isaggregation
/partial-update
, we have to setCHANGELOG_PRODUCER
tolookup
. But when usingPrimaryKeyPartialLookupTable
, changelog is useless.Therefore, I hope to add a compact-strategy configuration, so that
CHANGELOG_PRODUCER
can be enabled only when necessary.At the same time, for
FullCacheLookupTable
, even if the user does not enableCHANGELOG_PRODUCER
, we can also obtain the changelog throughIncrementalDiffSplitRead
.Solution
I would like to do it in two parts:
add the compact-strategy configuration, so that other types of
MergeEngine
can be compacted quickly without writing changelog.adjust the refresh strategy of
FullCacheLookupTable
to support streaming updates of other types ofMergeEngine
without enable CHANGELOG_PRODUCER.Anything else?
No response
Are you willing to submit a PR?