Open huyuanfeng2018 opened 1 year ago
cc @danny0405
Yeah, we can do that if we are sure the bloom filter is not needed, but this is also risky because you have no idea whether the table could be updated in the future.
You can supply a PR though, let's see how much gains we can get for the write throughput.
You can supply a PR though, let's see how much gains we can get for the write throughput
Yes, I recently tested the performance of massive real-time writing using iceberg and hudi. It seems that the logic of the two is basically the same in append mode, but hudi seems to have a lot of poor throughput, so I want to see what causes it. I'm making some comparisons. I think I can try turning off the write blog function first to see how much improvement there is. Perhaps you can give me some hints about the possibility of causing a significant difference between the two, thank you
Hudi does a lot of additional stuff when compared to Iceberg. Eg metadata table maintenance itself in my case takes the same time as read transform and write part (around 3m in CoW for inline metadata table maintenence) have you tried to disable metadata table or use async metadata table service? In my case microbatch processing time dropped from 7m to 3m once I enabled it. If you want higher throughput you can also disable cleaning, compaction and clustering and run it in a separate job. Have you tried these things?
You can supply a PR though, let's see how much gains we can get for the write throughput.
I ran the write with bloomfilter and the write without bloomfilter respectively during the peak period of our business on two days from 21:00 p.m. It is considered that it can reflect the peak consumption rate of hudi, and the results are as follows:
So, I think bloom filter may have a certain impact on write throughput, and if it is turned off, there may be more objective benefits @danny0405
a certain impact on write throughput
I'm confused why turning off the BF increased the write throughput.
a certain impact on write throughput
I'm confused why turning off the BF increased the write throughput.
I think that when writing, a BF structure will be inserted at the same time, which will increase the writing time
Then why turning off the BF
increases the performance then?
Then why
turning off the BF
increases the performance then?
I think the writing performance we are talking about may be different. The writing performance I want to express is the performance of the overall data entering the lake process, not just the performance of writing to the parquet file. I close it after writing to the parquet. After writing the data structure of BF, the overall performance is certain. Rather than the performance of writing to parquet, these two are theoretically unrelated
I'm just confused by your screenshot because from the picture the performance with BF enabled seems better.
I'm just confused by your screenshot because from the picture the performance with BF enabled seems better.
sorry,I got them backwards😓, The real result is that the overall write performance will be better after I remove BF, I reversed the two pictures
Okay, that is the ballpark no of performance gains for disabling the BF?
Okay, that is the ballpark no of performance gains for disabling the BF?
In our scenario, probably yes
Write in insert mode, but also write bloomfilter according to recordkey at the same time, I think you can set an option to turn off this function to increase write throughput
I did not find the corresponding setting in the 0.13 branch, it should be enabled by default