apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.42k stars 2.22k forks source link

dropDeleteFilesOlderthan should be partition level instead of table level #9383

Open zinking opened 10 months ago

zinking commented 10 months ago

Apache Iceberg version

1.4.2 (latest release)

Query engine

Spark

Please describe the bug 🐞

  public List<ManifestFile> apply(TableMetadata base, Snapshot snapshot) {
    // filter any existing manifests
    List<ManifestFile> filtered =
        filterManager.filterManifests(
            SnapshotUtil.schemaFor(base, targetBranch()),
            snapshot != null ? snapshot.dataManifests(ops.io()) : null);
    long minDataSequenceNumber =
        filtered.stream()
            .map(ManifestFile::minSequenceNumber)
            .filter(
                seq ->
                    seq
                        != ManifestWriter
                            .UNASSIGNED_SEQ) // filter out unassigned in rewritten manifests
            .reduce(base.lastSequenceNumber(), Math::min);
    deleteFilterManager.dropDeleteFilesOlderThan(minDataSequenceNumber);

the minDataSequenceNumber is calculated table wise, but in theory it should be partition wise ? obviously delete file within 1 partition only applies to that partition.

I am seeing v2 tables (partitioned tables) having delete files retained in partitions but those delete files wont apply to any data files within that partition.

zinking commented 10 months ago

@RussellSpitzer any comments ?

manuzhang commented 10 months ago

I am seeing v2 tables (partitioned tables) having delete files retained in partitions but those delete files wont apply to any data files within that partition.

This is mentioned in https://iceberg.apache.org/docs/latest/spark-procedures/#rewrite_position_delete_files as "dangling delete" problem. We don't know whether a delete file still refers to a live data file unless we compare their content with live data path, like what rewrite_position_delete_files does.

zinking commented 10 months ago

@manuzhang sounds different stuff. the issue pointed here is not POS delete specific. equality delete has same issue. the key here is partition delete files within a partition won't have effect in other partitions.

manuzhang commented 10 months ago

@zinking I see. An extreme case is if there's one partition left not compacted, none of the other partitions can drop their delete files after compaction.

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.