apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.47k stars 2.43k forks source link

[SUPPORT] Incremental query performance #7254

Open jklim96 opened 2 years ago

jklim96 commented 2 years ago

Describe the problem you faced

Hi all, I have a question about the performance of incremental queries. I'm comparing the performance between running incremental queries and simply doing a filter on the _hoodie_commit_time column. From initial investigations, it looks like for deltas with a small number of commits, incremental queries are more performant but as we increase the number of commits, the incremental queries actually take longer to complete compared to a column filter. Does anyone know why I'd be seeing this behaviour and whether it's expected? My expectation would have been that incremental queries will be more performant than filters in all scenarios as it would be scanning less data.

here are some results of the performance of the two:

        1 commit    10 commits  20 commits
Incremental 2'05"       3'12"       4'28"
Filter      2'55"       2'51"       3'03

attaching code snippets for reference:

incremental queries:
    beginTime = '20220928105015966'

    incremental_read_options = {
        'hoodie.datasource.query.type': 'incremental',
        'hoodie.datasource.read.begin.instanttime': beginTime,
    }

    df = spark.read.format("hudi"). \
        options(**incremental_read_options). \
        load("s3://bucketpath/")

    df.groupby("column_name").count().collect()

filter query:
    df = spark.read.format("hudi"). \
        load("s3://bucketpath/") 

    df.where("_hoodie_commit_time >= '20220928105015966'").groupby("column_name").count().collect()

To Reproduce

Steps to reproduce the behavior:

  1. Create an EMR Serverless Job on AWS
  2. Run the code specified in the section above

Expected behavior

Incremental queries are more performant than filter queries

Environment Description

Additional context

N/A

Stacktrace

N/A

scxwhite commented 2 years ago

Did you use the COW format to test it? You can try MOR format.

scxwhite commented 2 years ago

The incremental query in cow format is almost consistent with your filter conditions.

jklim96 commented 2 years ago

We are currently using CoW as some parts of our infrastructure on AWS currently don't allow us to use MoR due to some outdated dependencies.

Would you be able to give some additional information as to why the incremental queries on CoW is analogous to the filter conditions? Also, we're actually seeing worse performance in incremental queries compared to filters on CoW, where filter conditions seem to yield a relatively consistent performance regardless of how many commits we're looking back where incremental queries degrade as we increase the number of commits. Is this expected?

scxwhite commented 2 years ago

Because the table in Cow format will copy the previous data file every time the data is updated. Although you use incremental queries, it will still read the latest data file and filter the data you need. In principle, I think it is consistent with the filtering of data read from the whole table. Even the performance is not enough, because it can also do some file filtering operations, although it does not mean much to the COW table.

scxwhite commented 2 years ago

When most partitions and files in your table are updated, file filtering will degenerate. Incremental query and full filtering are almost identical.

jklim96 commented 2 years ago

For the 10 commit query which the incremental query took longer for, I've manually checked the files and confirmed that ~1600 files have been touched out of a total of ~14000 files. so the incremental query should theoretically be ~9x faster than the filter, yet we're seeing performance of incremental queries worse than the filter.

From the Hudi documentation:

For incremental views, you can expect speed up relative to how much data usually changes in a given time window and how much time your entire scan takes. For e.g, if only 100 files changed in the last hour in a partition of 1000 files, then you can expect a speed of 10x using incremental pull in Hudi compared to full scanning the partition to find out new data.

What's being explained in the documentation isn't quite the behaviour we're seeing.

scxwhite commented 2 years ago

Perhaps the incremental query degenerates into a full table query. You can check your archive configuration and clean configuration. Determine whether the incremental query record has been cleared. The simplest way is to query the latest commit record and compare it.