apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
https://paimon.apache.org/
Apache License 2.0
2.35k stars 928 forks source link

[Feature] Introduce key-value cache for paimon lookup operator in flink #3428

Open FangYongs opened 4 months ago

FangYongs commented 4 months ago

Search before asking

Motivation

When we use Paimon as the source for outer key joins, it is usually necessary to lookup the source table.

For example, there are two tables 1) Table A(a, b, c, c1, c2, c3, c4, c5), where a is the primary key 2) Table B(c, d, e, e1, e2, e3, e4, e5), where c is the primary key

Now we need to perform A JOIN B on A.c = B.c to output result (a, b, c, d, e, c1, c2, c3, c4, c5, e1, e2, e3, e4, e5).

In Flink, we can convert the outer key join into a primary key join. We first perform Join on A (a, c) and B (c) to obtain the related data of (a, c), and then lookup A and B respectively based on the a and c of the related data, and finally output the resulting data. During this process, due to the delay (default 10 seconds) in loading incremental data of the Paimon dimension table, it is possible that the related data of (a, c) fails to lookup the data of A and B in a timely manner, resulting in incorrect output results.

To solve this issue, I'd like to introduce key-value cache in Paimon for lookup operator. When data is written to Paimon, it can be written to a key-value cache before the snapshot is created. And when the downstream operator get data from Paimon, it can always lookup data from key-value cache correctly.

Solution

No response

Anything else?

No response

Are you willing to submit a PR?

ArthurSXL8 commented 1 month ago

Any progress here? Very useful feature and looking forward to release