[X] I searched in the issues and found nothing similar.
Motivation
When we use Paimon as the source for outer key joins, it is usually necessary to lookup the source table.
For example, there are two tables
1) Table A(a, b, c, c1, c2, c3, c4, c5), where a is the primary key
2) Table B(c, d, e, e1, e2, e3, e4, e5), where c is the primary key
Now we need to perform A JOIN B on A.c = B.c to output result (a, b, c, d, e, c1, c2, c3, c4, c5, e1, e2, e3, e4, e5).
In Flink, we can convert the outer key join into a primary key join. We first perform Join on A (a, c) and B (c) to obtain the related data of (a, c), and then lookup A and B respectively based on the a and c of the related data, and finally output the resulting data.
During this process, due to the delay (default 10 seconds) in loading incremental data of the Paimon dimension table, it is possible that the related data of (a, c) fails to lookup the data of A and B in a timely manner, resulting in incorrect output results.
To solve this issue, I'd like to introduce key-value cache in Paimon for lookup operator. When data is written to Paimon, it can be written to a key-value cache before the snapshot is created. And when the downstream operator get data from Paimon, it can always lookup data from key-value cache correctly.
Search before asking
Motivation
When we use Paimon as the source for outer key joins, it is usually necessary to lookup the source table.
For example, there are two tables 1) Table
A(a, b, c, c1, c2, c3, c4, c5)
, wherea
is the primary key 2) TableB(c, d, e, e1, e2, e3, e4, e5)
, wherec
is the primary keyNow we need to perform
A JOIN B
onA.c = B.c
to output result(a, b, c, d, e, c1, c2, c3, c4, c5, e1, e2, e3, e4, e5)
.In Flink, we can convert the outer key join into a primary key join. We first perform
Join
onA (a, c)
andB (c)
to obtain the related data of(a, c)
, and then lookupA
andB
respectively based on thea
andc
of the related data, and finally output the resulting data. During this process, due to the delay (default 10 seconds) in loading incremental data of the Paimon dimension table, it is possible that the related data of(a, c)
fails to lookup the data ofA
andB
in a timely manner, resulting in incorrect output results.To solve this issue, I'd like to introduce key-value cache in Paimon for lookup operator. When data is written to Paimon, it can be written to a key-value cache before the snapshot is created. And when the downstream operator get data from Paimon, it can always lookup data from key-value cache correctly.
Solution
No response
Anything else?
No response
Are you willing to submit a PR?