apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.49k stars 2.24k forks source link

Spark 3.5: Adapt DeleteFileIndexBenchmark for DVs #11529

Closed aokolnychyi closed 6 days ago

aokolnychyi commented 1 week ago

This PR adapts our DeleteFileIndexBenchmark for DVs.

Benchmark                                                           (type)  Mode  Cnt            Score         Error   Units
DeleteFileIndexBenchmark.buildIndexAndLookup                     partition    ss   10            0.475 ±       0.031    s/op
DeleteFileIndexBenchmark.buildIndexAndLookup                          file    ss   10            5.381 ±       0.224    s/op
DeleteFileIndexBenchmark.buildIndexAndLookup                            dv    ss   10            3.612 ±       0.201    s/op

The reason partition-scoped deletes are fastest is because the benchmark sets up a table with a small number of deep partitions (50K data files per partition) and only 100 delete files per partition. Therefore, the number of delete files differs dramatically. We should probably make this benchmark more representative in the future. DVs are faster than file-scoped deletes because they rely on referencedDataFile instead of reconstructing that value from bounds. I'd say the planning performance is acceptable for 2.5M DVs, but we may want to further optimize it.

This work is part of #11122.

aokolnychyi commented 6 days ago

Thanks, @jbonofre @nastra!

We may look into refactoring some of the benchmark code, but the experience shows it is rarely worth the time.