apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.35k stars 2.42k forks source link

[SUPPORT] Is this the expected number of S3 calls? #9612

Open HEPBO3AH opened 1 year ago

HEPBO3AH commented 1 year ago

Hi, we are using Hudi on AWS. We have noticed the following unexpected behavior.

A SELECT * FROM table creates a significant number of S3 calls:

+---------------------------------------------------------------------------------------------------------------------------------+----------+---+
|path                                                                                                                             |httpMethod|cnt|
+---------------------------------------------------------------------------------------------------------------------------------+----------+---+
|my_table/.hoodie                                                                                                                 |HEAD      |5  |
|my_table/.hoodie/                                                                                                                |HEAD      |5  |
|my_table/.hoodie/.aux/.bootstrap/.partitions/00000000-0000-0000-0000-000000000000-0_1-0-1_00000000000001.hfile                   |HEAD      |5  |
|my_table/.hoodie/.aux/.bootstrap/.partitions/00000000-0000-0000-0000-000000000000-0_1-0-1_00000000000001.hfile/                  |HEAD      |5  |
|my_table/.hoodie/20221124035739002.replacecommit                                                                                 |GET       |5  |
|my_table/.hoodie/20221127222955674.replacecommit                                                                                 |GET       |5  |
|my_table/.hoodie/20221128000946056.replacecommit                                                                                 |GET       |5  |
|my_table/.hoodie/20230203015652867.replacecommit                                                                                 |GET       |5  |
|my_table/.hoodie/20230203034909027.replacecommit                                                                                 |GET       |5  |
|my_table/.hoodie/20230323023115954.replacecommit                                                                                 |GET       |5  |
|my_table/.hoodie/20230323024631265.replacecommit                                                                                 |GET       |5  |
|my_table/.hoodie/20230323041457900.replacecommit                                                                                 |GET       |5  |
|my_table/.hoodie/20230627223911673.replacecommit                                                                                 |GET       |5  |
|my_table/.hoodie/20230706040420663.replacecommit                                                                                 |GET       |5  |
|my_table/.hoodie/20230821012127985.replacecommit                                                                                 |GET       |5  |
|my_table/.hoodie/20230821013120957.replacecommit                                                                                 |GET       |5  |
|my_table/.hoodie/20230823042339397.replacecommit                                                                                 |GET       |5  |
|my_table/.hoodie/hoodie.properties                                                                                               |GET       |5  |
|my_table/site_id%253D21/42d99963-db7f-400f-9e33-d539c74672aa-0_0-79-6549_20230323023115954.parquet                               |GET       |3  |
|my_table/site_id%253D22/431ca0d1-8af3-4a72-bd17-31f2cd7e97e9-0_0-39-5633_20230323023017903.parquet                               |GET       |3  |
|my_table/site_id%253D23/36675fed-d8ab-4532-aedc-dddf0a32accb-0_1-80-6551_20230323024631265.parquet                               |GET       |3  |
|my_table/site_id%253D24/20efa6e4-489c-4b1a-a474-5dc1731485ed-0_0-80-6551_20230323041457900.parquet                               |GET       |3  |
|my_table/site_id%253D30/15bfe605-bced-4d17-b571-62ebe64c5e97-0_0-80-6552_20230823042339397.parquet                               |GET       |3  |
|my_table/site_id%253D30/27d907a2-7485-450d-9cdd-9f9c7e95fe88-0_0-39-5633_20230823044858848.parquet                               |GET       |3  |
|my_table/site_id%253D21/.hoodie_partition_metadata                                                                               |HEAD      |1  |
|my_table/site_id%253D21/.hoodie_partition_metadata                                                                               |GET       |1  |
|my_table/site_id%253D21/42d99963-db7f-400f-9e33-d539c74672aa-0_0-79-6549_20230323023115954.parquet                               |HEAD      |1  |
|my_table/site_id%253D22/.hoodie_partition_metadata                                                                               |HEAD      |1  |
|my_table/site_id%253D22/.hoodie_partition_metadata                                                                               |GET       |1  |
|my_table/site_id%253D22/431ca0d1-8af3-4a72-bd17-31f2cd7e97e9-0_0-39-5633_20230323023017903.parquet                               |HEAD      |1  |
|my_table/site_id%253D23/.hoodie_partition_metadata                                                                               |HEAD      |1  |
|my_table/site_id%253D23/.hoodie_partition_metadata                                                                               |GET       |1  |
|my_table/site_id%253D23/36675fed-d8ab-4532-aedc-dddf0a32accb-0_1-80-6551_20230323024631265.parquet                               |HEAD      |1  |
|my_table/site_id%253D24/.hoodie_partition_metadata                                                                               |HEAD      |1  |
|my_table/site_id%253D24/.hoodie_partition_metadata                                                                               |GET       |1  |
|my_table/site_id%253D24/20efa6e4-489c-4b1a-a474-5dc1731485ed-0_0-80-6551_20230323041457900.parquet                               |HEAD      |1  |
|my_table/site_id%253D30/.hoodie_partition_metadata                                                                               |HEAD      |1  |
|my_table/site_id%253D30/.hoodie_partition_metadata                                                                               |GET       |1  |
|my_table/site_id%253D30/15bfe605-bced-4d17-b571-62ebe64c5e97-0_0-80-6552_20230823042339397.parquet                               |HEAD      |1  |
|my_table/site_id%253D30/27d907a2-7485-450d-9cdd-9f9c7e95fe88-0_0-39-5633_20230823044858848.parquet                               |HEAD      |1  |
+---------------------------------------------------------------------------------------------------------------------------------+----------+---+

Why are there so many HEAD calls? Why are there multiple GET calls per object?

I'm creating this ticket because we have significant number of S3 calls across our Hudi tables which seem quite out of place given how many queries we do. They are starting to have non-negligible cost implications and even managed to cause throttling on S3 which impacted the Hudi job runs.

codope commented 1 year ago

@HEPBO3AH THanks for raising the issue. May I know which Hudi version you're using? Also, can you confirm whether multiple GET requests for the same object are due to different byte ranges. Each byte range request counts as a separate GET request. I am assuming, based on objects listed above, that the metadata table is disabled. Did you also try with metadata enabled?

HEPBO3AH commented 1 year ago

May I know which Hudi version you're using?

We are on version 0.11.

Also, can you confirm whether multiple GET requests for the same object are due to different byte ranges. Each byte range request counts as a separate GET request.

I'll get back to you on the range question, but I can understand that part on the parquet files.
What I'm more interested in is the large amount of calls made to files/objects in /.hoodie/ folder and on the/.hoodie object itself which is completely unnecessary in S3 type of solution which doesn't have concept of folders. I created the example above to demonstrate the pattern, but in the production we had an issue where queries were throttled on S3 calls. The number of request made to /.hoodie as HEAD was in several hundred thousands. 1000x more calls than partition level hoodie_partition_metadata calls:

|my_table/.hoodie/      |HEAD      |701832|
|my_table/.hoodie       |HEAD      |60334 |

I am assuming, based on objects listed above, that the metadata table is disabled. Did you also try with metadata enabled?

Metadata table is enabled. However we use it in very limited capacity. Mostly for partition discovery.

ad1happy2go commented 1 year ago

@HEPBO3AH Is your table bootstrapped table? I tried to reproduce with a simple one but didn't noticed such High number of calls. But I do remember a issue with bootstrapped table.

HEPBO3AH commented 1 year ago

Hi @ad1happy2go . My understanding is that the bootstrapping is for tables which existed in non-hudi form and were later converted to hudi tables, is that correct?

If yes, this table was a hudi table from the start with metadata enables. It shouldn't be impacted by the same issue.

ad1happy2go commented 1 year ago

@HEPBO3AH Thanks for the information. Can you let us know how many partitions do your table have? Can you also try with later version of hudi i.e. 0.13.1 or 0.12.3.

Also we made quite a few fixes around reducing number of S3 calls with the new Hudi version 0.14 which will be out soon.

HEPBO3AH commented 1 year ago

Hi @ad1happy2go!

Thank you for the suggestion. Our system is a production system and updating versions has always been quite tricky due to some issues being fix while new issues are being introduced. This makes us reluctant to go to the bleeding edge.

I'd be more than happy to review the changes which have helped with this issue if you can link them?

ad1happy2go commented 1 year ago

There are quite a few after 0.11. Examples - https://github.com/apache/hudi/pull/7404 https://github.com/apache/hudi/pull/7436 https://github.com/apache/hudi/pull/7404

ad1happy2go commented 12 months ago

@HEPBO3AH Did these commits helped?