StarRocks / starrocks

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.
https://starrocks.io
Apache License 2.0
9.07k stars 1.82k forks source link

FE Connection Timeout Error using starrocks-spark-connector-3.4_2.12:1.1.2 stream load #50948

Open alanssec opened 2 months ago

alanssec commented 2 months ago

Title:

FE Connection Timeout Error using starrocks-spark-connector-3.4_2.12:1.1.2 stream load

Description:

We are experiencing connection timeout issues when using the starrocks-spark-connector-3.4_2.12:1.1.2 while attempting to load data from Spark DataFrames into StarRocks. The Spark session reads data in batches from a RabbitMQ queue and tries to write the data into StarRocks using the connector's stream load functionality as described in the StarRocks documentation. The issue arises after several hours or sometimes days of operation, causing the connection to StarRocks frontend services to fail.

Steps to reproduce the behavior (Required)

We are using a Spark session that reads data in batches from a RabbitMQ queue and tries to write the data into StarRocks using the StarRocks Spark connector stream load functionality.

Expected behavior (Required)

Data should be loaded into StarRocks without connection timeout issues when using the starrocks-spark-connector.

Real behavior (Required)

Connection to StarRocks frontend services fails after prolonged periods of operation, causing data loading to halt.

Connector Configuration:

"starrocks": {
    "starrocks.fe.http.url": "kube-starrocks-fe-service.starrocks.svc.cluster.local:8030",
    "starrocks.fe.jdbc.url": "jdbc:mysql://kube-starrocks-fe-service.starrocks.svc.cluster.local:9030",
    "starrocks.table.identifier": "db_stage.test_table_name",
    "starrocks.user": "*****",
    "starrocks.password": "*****",
    "starrocks.write.properties.format": "json",
    "starrocks.column.types": "column_name STRING",
    "starrocks.write.max.retries": 5,
    "starrocks.write.retry.interval.ms": 60000
}

Error logs:

[2024-09-07, 22:10:47 EDT] {spark_submit.py:536} INFO - at java.base/java.lang.Thread.run(Thread.java:840)
[2024-09-07, 22:10:47 EDT] {spark_submit.py:536} INFO - Caused by: org.apache.http.conn.ConnectTimeoutException: Connect to kube-starrocks-fe-0.kube-starrocks-fe-search.starrocks.svc.cluster.local:8030 [kube-starrocks-fe-0.kube-starrocks-fe-search.starrocks.svc.cluster.local/192.168.23.7] failed: Connection timed out
[2024-09-07, 22:10:47 EDT] {spark_submit.py:536} INFO - at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:156)
[2024-09-07, 22:10:47 EDT] {spark_submit.py:536} INFO - at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376)
[2024-09-07, 22:10:47 EDT] {spark_submit.py:536} INFO - at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
[2024-09-07, 22:10:47 EDT] {spark_submit.py:536} INFO - at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
[2024-09-07, 22:10:47 EDT] {spark_submit.py:536} INFO - at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
[2024-09-07, 22:10:47 EDT] {spark_submit.py:536} INFO - at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
[2024-09-07, 22:10:47 EDT] {spark_submit.py:536} INFO - at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
[2024-09-07, 22:10:47 EDT] {spark_submit.py:536} INFO - at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
[2024-09-07, 22:10:47 EDT] {spark_submit.py:536} INFO - at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
[2024-09-07, 22:10:47 EDT] {spark_submit.py:536} INFO - at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108)
[2024-09-07, 22:10:47 EDT] {spark_submit.py:536} INFO - at com.starrocks.data.load.stream.StreamLoadUtils.isStarRocksSupportTransactionLoad(StreamLoadUtils.java:92)
[2024-09-07, 22:10:47 EDT] {spark_submit.py:536} INFO - ... 77 more
[2024-09-07, 22:10:47 EDT] {spark_submit.py:536} INFO - Caused by: java.net.ConnectException: Connection timed out

In fe.log, the error is:

ERROR (UNKNOWN kube-starrocks-fe-0.kube-starrocks-fe-search.starrocks.svc.cluster.local_9010_1713306784900(-1)|1)

Alternate Configuration Attempt Using FE Proxy:

We also tried using the following configuration through a proxy service, but the issue persists. The error here is different, resulting in a gateway error (504):

"starrocks": {
    "starrocks.fe.http.url": "http://kube-starrocks-fe-proxy-service.starrocks.svc.cluster.local:8080",
    "starrocks.fe.jdbc.url": "jdbc:mysql://kube-starrocks-fe-service.starrocks.svc.cluster.local:9030",
    "starrocks.table.identifier": "test_table_name",
    "starrocks.user": "*****",
    "starrocks.password": "*****",
    "starrocks.write.properties.format": "json",
    "starrocks.column.types": "column_name STRING"
}

Additional error log for proxy configuration:

[2024-09-10, 15:55:37 -03] {spark_submit.py:490} INFO - py4j.protocol.Py4JJavaError: An error occurred while calling o11003.save.
[2024-09-10, 15:55:37 -03] {spark_submit.py:490} INFO - : java.lang.RuntimeException: Failed to probe transaction load for http://kube-starrocks-fe-proxy-service.starrocks.svc.cluster.local:8080
...
Caused by: com.fasterxml.jackson.core.JsonParseException: Unexpected character ('<' (code 60)): expected a valid value (JSON String, Number, Array, Object or token 'null', 'true' or 'false')
...
<html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx/1.24.0</center>
</body>
</html>

StarRocks Version (Required)

yandongxiao commented 2 months ago

‌‌‌‌‌‌Thank you for providing logs for each module, could you also help send the complete logs of the FE? If there are multiple FEs, please send the logs of each FE.

alanssec commented 4 weeks ago

Attached are the requested fe.log and fe.gc.log files. Since the full log files are quite large it contains only the relevant sections from the test and error reproduction.

The full test was conducted in a non-production environment, but the error remains the same as in the production environment. In non-production, we have two FEs, and the log files were downloaded from both. As our process uses Airflow to connect with Starrocks, we used a test DAG to reproduce the issue. The full log of this DAG is included in the zip. logs.zip

Let me know if you need any additional information or further details.

Best regards, Alan

alanssec commented 1 week ago

Hi @yandongxiao,

I wanted to check if you've had a chance to review the information I provided. Please let me know if you need any additional details.

Thanks!

Best regards, Alan

kevincai commented 1 week ago

@alanssec Not sure if it is related to this #52516 issue, maybe you can wait our release (v3.2.13/v3.3.6/v3.1.16) with the fix and repro this?

alanssec commented 5 hours ago

We tried pointing directly to the IP of the FE leader as a test, knowing that these IPs are ephemeral and we should always use the DNS. However, we encountered issues keeping the FE stable. We're continuing to debug in more detail using tcpdump. Additionally, we found a recurring error and we're unsure if it's related to the sporadic "Connection timed out" error. Here's the log:

[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG RequestAuthCache: Auth cache not set in the context
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG PoolingHttpClientConnectionManager: Connection request: [route: {}->http://kube-starrocks-fe-service.starrocks.svc.cluster.local:8030][total available: 0; route allocated: 0 of 2; total allocated: 0 of 20]
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG PoolingHttpClientConnectionManager: Connection leased: [id: 72][route: {}->http://kube-starrocks-fe-service.starrocks.svc.cluster.local:8030][total available: 0; route allocated: 1 of 2; total allocated: 1 of 20]
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG MainClientExec: Opening connection {}->http://kube-starrocks-fe-service.starrocks.svc.cluster.local:8030
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG DefaultHttpClientConnectionOperator: Connecting to kube-starrocks-fe-service.starrocks.svc.cluster.local/10.109.38.37:8030
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG DefaultHttpClientConnectionOperator: Connection established 192.168.10.128:42888<->10.109.38.37:8030
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG MainClientExec: Executing request POST /api/transaction/begin HTTP/1.1
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG MainClientExec: Proxy auth state: UNCHALLENGED
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG headers: http-outgoing-72 >> POST /api/transaction/begin HTTP/1.1
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG headers: http-outgoing-72 >> Authorization: Basic ************[\
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG headers: http-outgoing-72 >> Content-Length: 0
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG headers: http-outgoing-72 >> Host: kube-starrocks-fe-service.starrocks.svc.cluster.local:8030
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG headers: http-outgoing-72 >> Connection: Keep-Alive
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG headers: http-outgoing-72 >> User-Agent: Apache-HttpClient/4.5.14 (Java/17.0.9)
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG headers: http-outgoing-72 >> Accept-Encoding: gzip,deflate
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG wire: http-outgoing-72 >> "POST /api/transaction/begin HTTP/1.1[\r][\n]"
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG wire: http-outgoing-72 >> "Authorization: Basic ************[\r][\n]"
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG wire: http-outgoing-72 >> "Content-Length: 0[\r][\n]"
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG wire: http-outgoing-72 >> "Host: kube-starrocks-fe-service.starrocks.svc.cluster.local:8030[\r][\n]"
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG wire: http-outgoing-72 >> "Connection: Keep-Alive[\r][\n]"
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG wire: http-outgoing-72 >> "User-Agent: Apache-HttpClient/4.5.14 (Java/17.0.9)[\r][\n]"
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG wire: http-outgoing-72 >> "Accept-Encoding: gzip,deflate[\r][\n]"
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG wire: http-outgoing-72 >> "[\r][\n]"
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG wire: http-outgoing-72 << "HTTP/1.1 200 OK[\r][\n]"
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG wire: http-outgoing-72 << "content-length: 104[\r][\n]"
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG wire: http-outgoing-72 << "content-type: text/html[\r][\n]"
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG wire: http-outgoing-72 << "connection: keep-alive[\r][\n]"
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG wire: http-outgoing-72 << "[\r][\n]"
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG wire: http-outgoing-72 << "{[\n]"
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG wire: http-outgoing-72 << "  "Status": "FAILED",[\n]"
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG wire: http-outgoing-72 << "  "Message": "class com.starrocks.common.UserException: No database selected."[\n]"
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG wire: http-outgoing-72 << "}"
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG headers: http-outgoing-72 << HTTP/1.1 200 OK
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG headers: http-outgoing-72 << content-length: 104
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG headers: http-outgoing-72 << content-type: text/html
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG headers: http-outgoing-72 << connection: keep-alive
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG MainClientExec: Connection can be kept alive indefinitely
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG PoolingHttpClientConnectionManager: Connection [id: 72][route: {}->http://kube-starrocks-fe-service.starrocks.svc.cluster.local:8030] can be kept alive indefinitely
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG DefaultManagedHttpClientConnection: http-outgoing-72: set socket timeout to 0

Under these conditions, the process continues making insertions correctly until at some point it fails with the "Connection timed out" error.