apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
https://paimon.apache.org/
Apache License 2.0
2.27k stars 911 forks source link

[Bug] SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool #924

Closed www2388258980 closed 1 year ago

www2388258980 commented 1 year ago

Search before asking

Paimon version

paimon0.4

Compute Engine

flink1.16

Minimal reproduce step

描述
/bin/yarn-session.sh --detached \
-Dtaskmanager.memory.process.size=5000m \
-Dtaskmanager.memory.managed.size=0m \
-Dtaskmanager.memory.network.min=80m \
-Dtaskmanager.memory.network.max=80m \
-Dtaskmanager.numberOfTaskSlots=4

flink on yarn

文件系统使用s3,使用paimon构建实时数仓分层,比如会查几张ods的paimon表,写入到一张'merge-engine' = 'partial-update'的大宽表.
运行一段时间,半小时或者1小时以上。
其他任务(flink cdc)插入到s3://xxxxxxx/hadoop/warehouse/ods_medatc_fts.db/src_public_comments/schema/schema-0任务正常。

ava.io.UncheckedIOException: java.io.InterruptedIOException: getFileStatus on s3://xxxxxxx/hadoop/warehouse/ods_medatc_fts.db/src_public_comments/schema/schema-0: com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool at org.apache.paimon.schema.SchemaManager.schema(SchemaManager.java:460) at org.apache.paimon.operation.KeyValueFileStoreRead.(KeyValueFileStoreRead.java:88) at org.apache.paimon.KeyValueFileStore.newRead(KeyValueFileStore.java:84) at org.apache.paimon.table.ChangelogWithKeyFileStoreTable.newRead(ChangelogWithKeyFileStoreTable.java:193) at org.apache.paimon.table.source.ReadBuilderImpl.newRead(ReadBuilderImpl.java:81) at org.apache.paimon.flink.source.FlinkSource.createReader(FlinkSource.java:50) at org.apache.flink.streaming.api.operators.SourceOperator.initReader(SourceOperator.java:286) at org.apache.flink.streaming.runtime.tasks.SourceOperatorStreamTask.init(SourceOperatorStreamTask.java:94) at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreInternal(StreamTask.java:692) at org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:669) at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:935) at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:904) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:728) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:550) at java.lang.Thread.run(Thread.java:750) Caused by: java.io.InterruptedIOException: getFileStatus on s3://xxxxxxx/hadoop/warehouse/ods_medatc_fts.db/src_public_comments/schema/schema-0: com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool at org.apache.hadoop.fs.s3a.S3AUtils.translateInterruptedException(S3AUtils.java:395) at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:201) at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:175) at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3799) at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3688) at org.apache.hadoop.fs.s3a.S3AFileSystem.extractOrFetchSimpleFileStatus(S3AFileSystem.java:5401) at org.apache.hadoop.fs.s3a.S3AFileSystem.open(S3AFileSystem.java:1465) at org.apache.hadoop.fs.s3a.S3AFileSystem.open(S3AFileSystem.java:1441) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:976) at org.apache.paimon.s3.HadoopCompliantFileIO.newInputStream(HadoopCompliantFileIO.java:47) at org.apache.paimon.fs.PluginFileIO.lambda$newInputStream$0(PluginFileIO.java:47) at org.apache.paimon.fs.PluginFileIO.wrap(PluginFileIO.java:104) at org.apache.paimon.fs.PluginFileIO.newInputStream(PluginFileIO.java:47) at org.apache.paimon.fs.FileIO.readFileUtf8(FileIO.java:173) at org.apache.paimon.schema.SchemaManager.schema(SchemaManager.java:458) ... 14 more Caused by: com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1219) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1165) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:814) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:781) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:755) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:715) at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:697) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:561) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:541) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5456) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5403) at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1372) at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$getObjectMetadata$10(S3AFileSystem.java:2545) at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:414) at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:377) at org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:2533) at org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:2513) at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3776) ... 25 more Caused by: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:316) at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:282) at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70) at com.amazonaws.http.conn.$Proxy46.get(Unknown Source) at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:190) at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1346) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1157)

What doesn't meet your expectations?

repair

Anything else?

No response

Are you willing to submit a PR?

JingsongLi commented 1 year ago

https://github.com/aws/aws-sdk-java/issues/1405 Too much parallelism may cause this problem

JingsongLi commented 1 year ago

See https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/performance.html

www2388258980 commented 1 year ago

https://repost.aws/zh-Hans/knowledge-center/emr-timeout-connection-wait

www2388258980 commented 1 year ago

aws/aws-sdk-java#1405 Too much parallelism may cause this problem

job平行度是1,一个taskmanager里面有3个job在跑。

JingsongLi commented 1 year ago

We may can try fs.s3a.connection.maximum=1000

www2388258980 commented 1 year ago

小文件过多会导致s3连接池不够用,可以通过fs.s3a.connection.maximum提高连接池数量。 参考文档:
【1】https://paimon.apache.org/docs/master/maintenance/expiring-snapshots/ 【2】https://www.infoq.cn/article/dytkx8luglcu9a81f58q 【3】https://docs.aws.amazon.com/zh_cn/sdk-for-java/latest/developer-guide/best-practices.html 【4】https://zhuanlan.zhihu.com/p/559718865

JingsongLi commented 1 year ago

https://github.com/apache/incubator-paimon/pull/1037 also fixed this