apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
https://paimon.apache.org/
Apache License 2.0
2.35k stars 928 forks source link

[Feature] Auto complete missing tags #3647

Closed luowanghaoyun closed 3 months ago

luowanghaoyun commented 3 months ago

Search before asking

Motivation

In auto tag creation mode, if the Flink write task is stopped for more than one creation period, there will be a loss of tag occurrence. For example:

        Options options = new Options();
        options.set(TAG_AUTOMATIC_CREATION, TagCreationMode.WATERMARK);
        options.set(TAG_CREATION_PERIOD, TagCreationPeriod.HOURLY);
        options.set(TAG_NUM_RETAINED_MAX, 3);
        FileStoreTable table = this.table.copy(options.toMap());
        TableCommitImpl commit = table.newCommit(commitUser).ignoreEmptyCommit(false);
        TagManager tagManager = table.store().newTagManager();

        // test normal creation
        commit.commit(new ManifestCommittable(0, utcMills("2023-07-18T12:12:00")));
        assertThat(tagManager.allTagNames()).containsOnly("2023-07-18 11");

        // task stop long time

        // task restart
        commit.commit(new ManifestCommittable(1, utcMills("2023-07-18T15:12:00")));
        // tag '2023-07-18 12' and '2023-07-18 13' are missing
        assertThat(tagManager.allTagNames()).containsOnly("2023-07-18 11", "2023-07-18 14");  

In some cases, missing tags may affect downstream tasks.

Solution

No response

Anything else?

No response

Are you willing to submit a PR?