Open jklim96 opened 2 years ago
Did you use the COW format to test it? You can try MOR format.
The incremental query in cow format is almost consistent with your filter conditions.
We are currently using CoW as some parts of our infrastructure on AWS currently don't allow us to use MoR due to some outdated dependencies.
Would you be able to give some additional information as to why the incremental queries on CoW is analogous to the filter conditions? Also, we're actually seeing worse performance in incremental queries compared to filters on CoW, where filter conditions seem to yield a relatively consistent performance regardless of how many commits we're looking back where incremental queries degrade as we increase the number of commits. Is this expected?
Because the table in Cow format will copy the previous data file every time the data is updated. Although you use incremental queries, it will still read the latest data file and filter the data you need. In principle, I think it is consistent with the filtering of data read from the whole table. Even the performance is not enough, because it can also do some file filtering operations, although it does not mean much to the COW table.
When most partitions and files in your table are updated, file filtering will degenerate. Incremental query and full filtering are almost identical.
For the 10 commit query which the incremental query took longer for, I've manually checked the files and confirmed that ~1600 files have been touched out of a total of ~14000 files. so the incremental query should theoretically be ~9x faster than the filter, yet we're seeing performance of incremental queries worse than the filter.
From the Hudi documentation:
For incremental views, you can expect speed up relative to how much data usually changes in a given time window and how much time your entire scan takes. For e.g, if only 100 files changed in the last hour in a partition of 1000 files, then you can expect a speed of 10x using incremental pull in Hudi compared to full scanning the partition to find out new data.
What's being explained in the documentation isn't quite the behaviour we're seeing.
Perhaps the incremental query degenerates into a full table query. You can check your archive configuration and clean configuration. Determine whether the incremental query record has been cleared. The simplest way is to query the latest commit record and compare it.
Describe the problem you faced
Hi all, I have a question about the performance of incremental queries. I'm comparing the performance between running incremental queries and simply doing a filter on the
_hoodie_commit_time
column. From initial investigations, it looks like for deltas with a small number of commits, incremental queries are more performant but as we increase the number of commits, the incremental queries actually take longer to complete compared to a column filter. Does anyone know why I'd be seeing this behaviour and whether it's expected? My expectation would have been that incremental queries will be more performant than filters in all scenarios as it would be scanning less data.here are some results of the performance of the two:
attaching code snippets for reference:
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Incremental queries are more performant than filter queries
Environment Description
Hudi version : 0.11.1-amzn-0
Spark version : 3.3.0
Hive version : 3.1.3
Hadoop version : 3.2.1
Storage (HDFS/S3/GCS..) : N/A
Running on Docker? (yes/no) : no
Additional context
N/A
Stacktrace
N/A