apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.4k stars 2.43k forks source link

[SUPPORT] S3 Buckets reached quota limit when reading from hudi tables #7487

Open AdarshKadameriTR opened 1 year ago

AdarshKadameriTR commented 1 year ago

We have hudi tables in S3 buckets and each buckets are having 40+ Terra bytes data and some having 80+ TB. Recently or application is failing with error when reading S3 buckets. When we contacted with AWS support they informed that we have reached quota limits in different occasions in the last 24 hours and multiple times for the last 7 days. These buckets contains only hudi tables. How does we are reaching quota limits and which api quota has to be increased?

AWS Case ID 11532026531.

Environment Description

*EMR Version: emr-6.7.0

Stacktrace

`2022-12-14T19:50:30.207+0000 [INFO] [1670994048838prod_correlation_id] [com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor] [GlobalS3Executor]: ReadTimeout File: xxxxxxxxxx/xxxxxxxxxx/5301b299-7abc-4230-8e23-ca7128074103-3_1133-1242-812796_20221214084918037.parquet; Range: [48449360, 48884521] Use default timeout configuration to retry for read timeout {} com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Read timed out at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1216) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1162) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:811) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:779) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(Amazon /xxxxxxxxxx/5301b299-7abc-4230-8e23-ca7128074103-3_1133-1242-812796_20221214084918037.parquet' for reading 2022-12-14T19:50:26.920+0000 [INFO] [1670994048838prod_correlation_id] [com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor] [GlobalS3Executor]: ReadTimeout File: xxxxxxxxxx/xxxxxxxxxx/a7c86b71-a3f9-43e5-a3ae-ca9eefbd6f78-2_846-952-645550_20221214074208046.parquet; Range: [136089626, 136526893] Use default timeout configuration to retry for read timeout {} com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Read timed out at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1216) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1162) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:811) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:779) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:753) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$

[1670994048838prod_correlation_id] [com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor] [GlobalS3Executor]: ReadTimeout File: xxxxxxxxxx/xxxxxxxxxx/a7c86b71-a3f9-43e5-a3ae-ca9eefbd6f78-2_846-952-645550_20221214074208046.parquet; Range: [136089626, 136526893] Use default timeout configuration to retry for read timeout {}

`

yihua commented 1 year ago

Hi @AdarshKadameriTR Thanks for raising this. Does the read timeout happen in a write job or a query? Could you ask the AWS support to clarify what types of quota limits are reached?

I'm not aware of any hard quota limit on reading or writing files on S3. S3 charges more for a higher number of requests going to the buckets. There is rate limiting or throttle on the requests going to S3: for example, you can send 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix in an S3 bucket. Even if the Spark job hits throttling, usually retries should get around it with longer job time.

Looping in AWS EMR folks. @umehrot2 @rahil-c have you folks seen such an issue for Hudi tables on S3?

AdarshKadameriTR commented 1 year ago

Hi @yihua we discussed with AWS support person and find out 503 errors for ‘REST.GET.BUCKET’ which refer to LIST requests made to list the bucket. and was seeing a very high number of list requests being made to the bucket, as high as 700 requests per second. But from application code we are not doing this API call. We suspect hudi internally calls this APIs.

S3-ListBucket-Throttling(18) (2)

cloudwatch_graph

xushiyan commented 1 year ago

Is this still happening? pls share more info like what the job is doing when this occurs - is it reading or writing? the logs would tell. It's likely due to a lot of small files. have you run clustering for this table? what do the writer configs look like?

AdarshKadameriTR commented 1 year ago

Hi @xushiyan ,

We are incrementally upserting data into our Hudi table/s every 5 minutes. We have set CLEANER_POLICY as KEEP_LATEST_BY_HOURS with CLEANER_HOURS_RETAINED = 48. The only command we execute is Upsert and we have single writer and compaction runs every hour.

pls share more info like what the job is doing when this occurs - is it reading or writing? : Our application job is only doing write operation using upserts as mentioned above. As per discussion with AWS they see s3 get API up to 700 times per second. From the logs we can see Hudi internally is calling these get operations on the log files in table partitions. Most likely Hudi compaction is calling those read operations.

have you run clustering for this table? We have not enabled clustering on the tables.

what do the writer configs look like? Given in below screenshots

210503366-77d47c7c-169f-4a87-8234-0971079a9347 210501558-28eb3712-fed8-4c93-9c85-ccb6ef3521dc

Partition structure: s3://bucket/table/partition/parquet and .log files

Note:- We have an open issue on old log files not getting cleaned by hudi cleaner. https://github.com/apache/hudi/issues/7600

yihua commented 1 year ago

Hi @AdarshKadameriTR to fully understand where these S3 requests / API calls come from, you should enable S3 request logs by setting log4j.logger.com.amazonaws.request=DEBUG in log4j properties file and the following Spark configs:

--conf spark.driver.extraJavaOptions="-Dlog4j.configuration=file:/<path>/s3-debug.log4j.properties"   --conf spark.executor.extraJavaOptions="-Dlog4j.configuration=file:/<path>/s3-debug.log4j.properties"

Then in the driver and executor logs you'll see request logs like:

DEBUG request: Sending Request: GET https://<bucket>.s3.us-east-2.amazonaws.com / Parameters: ({"list-type":["2"],"delimiter":["/"],"max-keys":["5000"],"prefix":["table/.hoodie/metadata/.hoodie/"],"fetch-owner":["false"]}Headers: ... 

This helps you understand which prefix / directory triggers the most S3 requests, and then we can dig deeper into why that's happening.

yihua commented 1 year ago

503 errors mean that the throttling limit of S3 requests is hit, causing backlog or timeout, making the jobs fail easily.

lucabem commented 1 year ago

Hi @yihua.

Im facing the same error on EMR on EKS using Hudi 0.12.1. Im having slow download S3 file on first stages (check if is empty stage), downloading 300MB parquet on 1 hour.

I have tried your s3-debug.log4j.properties but it doesnt give me anythin unless this logs:

ERROR StatusLogger Reconfiguration failed: No configuration found for '18b4aac2' at 'null' in 'null'
ERROR StatusLogger Reconfiguration failed: No configuration found for 'Default' at 'null' in 'null'

In my case, EMR ends fine but sometimes it shows me logs

23/01/16 10:56:05 INFO GlobalS3Executor: ReadTimeout File: landing/dms/full/my_system/my_table/LOAD00000003.parquet; Range: [62489073, 62491066]
Use default timeout configuration to retry for read timeout com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Read timed out
nsivabalan commented 1 year ago

We made some fixes to 0.12.3 and 0.13.0 on fixing unnecessary calls to FS https://github.com/apache/hudi/pull/7561 can you try them and let us know. should bring down your S3 calls.