Open AdarshKadameriTR opened 1 year ago
Hi @AdarshKadameriTR Thanks for raising this. Does the read timeout happen in a write job or a query? Could you ask the AWS support to clarify what types of quota limits are reached?
I'm not aware of any hard quota limit on reading or writing files on S3. S3 charges more for a higher number of requests going to the buckets. There is rate limiting or throttle on the requests going to S3: for example, you can send 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix in an S3 bucket. Even if the Spark job hits throttling, usually retries should get around it with longer job time.
Looping in AWS EMR folks. @umehrot2 @rahil-c have you folks seen such an issue for Hudi tables on S3?
Hi @yihua we discussed with AWS support person and find out 503 errors for ‘REST.GET.BUCKET’ which refer to LIST requests made to list the bucket. and was seeing a very high number of list requests being made to the bucket, as high as 700 requests per second. But from application code we are not doing this API call. We suspect hudi internally calls this APIs.
Is this still happening? pls share more info like what the job is doing when this occurs - is it reading or writing? the logs would tell. It's likely due to a lot of small files. have you run clustering for this table? what do the writer configs look like?
Hi @xushiyan ,
We are incrementally upserting data into our Hudi table/s every 5 minutes. We have set CLEANER_POLICY as KEEP_LATEST_BY_HOURS with CLEANER_HOURS_RETAINED = 48. The only command we execute is Upsert and we have single writer and compaction runs every hour.
pls share more info like what the job is doing when this occurs - is it reading or writing? : Our application job is only doing write operation using upserts as mentioned above. As per discussion with AWS they see s3 get API up to 700 times per second. From the logs we can see Hudi internally is calling these get operations on the log files in table partitions. Most likely Hudi compaction is calling those read operations.
have you run clustering for this table? We have not enabled clustering on the tables.
what do the writer configs look like? Given in below screenshots
Partition structure: s3://bucket/table/partition/parquet and .log files
Note:- We have an open issue on old log files not getting cleaned by hudi cleaner. https://github.com/apache/hudi/issues/7600
Hi @AdarshKadameriTR to fully understand where these S3 requests / API calls come from, you should enable S3 request logs by setting log4j.logger.com.amazonaws.request=DEBUG
in log4j properties file and the following Spark configs:
--conf spark.driver.extraJavaOptions="-Dlog4j.configuration=file:/<path>/s3-debug.log4j.properties" --conf spark.executor.extraJavaOptions="-Dlog4j.configuration=file:/<path>/s3-debug.log4j.properties"
Then in the driver and executor logs you'll see request logs like:
DEBUG request: Sending Request: GET https://<bucket>.s3.us-east-2.amazonaws.com / Parameters: ({"list-type":["2"],"delimiter":["/"],"max-keys":["5000"],"prefix":["table/.hoodie/metadata/.hoodie/"],"fetch-owner":["false"]}Headers: ...
This helps you understand which prefix / directory triggers the most S3 requests, and then we can dig deeper into why that's happening.
503 errors mean that the throttling limit of S3 requests is hit, causing backlog or timeout, making the jobs fail easily.
Hi @yihua.
Im facing the same error on EMR on EKS using Hudi 0.12.1. Im having slow download S3 file on first stages (check if is empty stage), downloading 300MB parquet on 1 hour.
I have tried your s3-debug.log4j.properties but it doesnt give me anythin unless this logs:
ERROR StatusLogger Reconfiguration failed: No configuration found for '18b4aac2' at 'null' in 'null'
ERROR StatusLogger Reconfiguration failed: No configuration found for 'Default' at 'null' in 'null'
In my case, EMR ends fine but sometimes it shows me logs
23/01/16 10:56:05 INFO GlobalS3Executor: ReadTimeout File: landing/dms/full/my_system/my_table/LOAD00000003.parquet; Range: [62489073, 62491066]
Use default timeout configuration to retry for read timeout com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Read timed out
We made some fixes to 0.12.3 and 0.13.0 on fixing unnecessary calls to FS https://github.com/apache/hudi/pull/7561 can you try them and let us know. should bring down your S3 calls.
We have hudi tables in S3 buckets and each buckets are having 40+ Terra bytes data and some having 80+ TB. Recently or application is failing with error when reading S3 buckets. When we contacted with AWS support they informed that we have reached quota limits in different occasions in the last 24 hours and multiple times for the last 7 days. These buckets contains only hudi tables. How does we are reaching quota limits and which api quota has to be increased?
AWS Case ID 11532026531.
Environment Description
*EMR Version: emr-6.7.0
Hudi version : 0.11.1
Spark version : Spark 3.2.1
Hive version : Hive 3.1.3
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : No
Stacktrace
`2022-12-14T19:50:30.207+0000 [INFO] [1670994048838prod_correlation_id] [com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor] [GlobalS3Executor]: ReadTimeout File: xxxxxxxxxx/xxxxxxxxxx/5301b299-7abc-4230-8e23-ca7128074103-3_1133-1242-812796_20221214084918037.parquet; Range: [48449360, 48884521] Use default timeout configuration to retry for read timeout {} com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Read timed out at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1216) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1162) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:811) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:779) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(Amazon /xxxxxxxxxx/5301b299-7abc-4230-8e23-ca7128074103-3_1133-1242-812796_20221214084918037.parquet' for reading 2022-12-14T19:50:26.920+0000 [INFO] [1670994048838prod_correlation_id] [com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor] [GlobalS3Executor]: ReadTimeout File: xxxxxxxxxx/xxxxxxxxxx/a7c86b71-a3f9-43e5-a3ae-ca9eefbd6f78-2_846-952-645550_20221214074208046.parquet; Range: [136089626, 136526893] Use default timeout configuration to retry for read timeout {} com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Read timed out at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1216) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1162) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:811) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:779) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:753) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$
[1670994048838prod_correlation_id] [com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor] [GlobalS3Executor]: ReadTimeout File: xxxxxxxxxx/xxxxxxxxxx/a7c86b71-a3f9-43e5-a3ae-ca9eefbd6f78-2_846-952-645550_20221214074208046.parquet; Range: [136089626, 136526893] Use default timeout configuration to retry for read timeout {}
`