Open alanssec opened 2 months ago
Thank you for providing logs for each module, could you also help send the complete logs of the FE? If there are multiple FEs, please send the logs of each FE.
Attached are the requested fe.log and fe.gc.log files. Since the full log files are quite large it contains only the relevant sections from the test and error reproduction.
The full test was conducted in a non-production environment, but the error remains the same as in the production environment. In non-production, we have two FEs, and the log files were downloaded from both. As our process uses Airflow to connect with Starrocks, we used a test DAG to reproduce the issue. The full log of this DAG is included in the zip. logs.zip
Let me know if you need any additional information or further details.
Best regards, Alan
Hi @yandongxiao,
I wanted to check if you've had a chance to review the information I provided. Please let me know if you need any additional details.
Thanks!
Best regards, Alan
@alanssec Not sure if it is related to this #52516 issue, maybe you can wait our release (v3.2.13/v3.3.6/v3.1.16) with the fix and repro this?
We tried pointing directly to the IP of the FE leader as a test, knowing that these IPs are ephemeral and we should always use the DNS. However, we encountered issues keeping the FE stable. We're continuing to debug in more detail using tcpdump. Additionally, we found a recurring error and we're unsure if it's related to the sporadic "Connection timed out" error. Here's the log:
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG RequestAuthCache: Auth cache not set in the context
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG PoolingHttpClientConnectionManager: Connection request: [route: {}->http://kube-starrocks-fe-service.starrocks.svc.cluster.local:8030][total available: 0; route allocated: 0 of 2; total allocated: 0 of 20]
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG PoolingHttpClientConnectionManager: Connection leased: [id: 72][route: {}->http://kube-starrocks-fe-service.starrocks.svc.cluster.local:8030][total available: 0; route allocated: 1 of 2; total allocated: 1 of 20]
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG MainClientExec: Opening connection {}->http://kube-starrocks-fe-service.starrocks.svc.cluster.local:8030
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG DefaultHttpClientConnectionOperator: Connecting to kube-starrocks-fe-service.starrocks.svc.cluster.local/10.109.38.37:8030
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG DefaultHttpClientConnectionOperator: Connection established 192.168.10.128:42888<->10.109.38.37:8030
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG MainClientExec: Executing request POST /api/transaction/begin HTTP/1.1
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG MainClientExec: Proxy auth state: UNCHALLENGED
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG headers: http-outgoing-72 >> POST /api/transaction/begin HTTP/1.1
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG headers: http-outgoing-72 >> Authorization: Basic ************[\
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG headers: http-outgoing-72 >> Content-Length: 0
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG headers: http-outgoing-72 >> Host: kube-starrocks-fe-service.starrocks.svc.cluster.local:8030
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG headers: http-outgoing-72 >> Connection: Keep-Alive
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG headers: http-outgoing-72 >> User-Agent: Apache-HttpClient/4.5.14 (Java/17.0.9)
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG headers: http-outgoing-72 >> Accept-Encoding: gzip,deflate
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG wire: http-outgoing-72 >> "POST /api/transaction/begin HTTP/1.1[\r][\n]"
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG wire: http-outgoing-72 >> "Authorization: Basic ************[\r][\n]"
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG wire: http-outgoing-72 >> "Content-Length: 0[\r][\n]"
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG wire: http-outgoing-72 >> "Host: kube-starrocks-fe-service.starrocks.svc.cluster.local:8030[\r][\n]"
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG wire: http-outgoing-72 >> "Connection: Keep-Alive[\r][\n]"
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG wire: http-outgoing-72 >> "User-Agent: Apache-HttpClient/4.5.14 (Java/17.0.9)[\r][\n]"
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG wire: http-outgoing-72 >> "Accept-Encoding: gzip,deflate[\r][\n]"
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG wire: http-outgoing-72 >> "[\r][\n]"
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG wire: http-outgoing-72 << "HTTP/1.1 200 OK[\r][\n]"
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG wire: http-outgoing-72 << "content-length: 104[\r][\n]"
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG wire: http-outgoing-72 << "content-type: text/html[\r][\n]"
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG wire: http-outgoing-72 << "connection: keep-alive[\r][\n]"
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG wire: http-outgoing-72 << "[\r][\n]"
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG wire: http-outgoing-72 << "{[\n]"
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG wire: http-outgoing-72 << " "Status": "FAILED",[\n]"
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG wire: http-outgoing-72 << " "Message": "class com.starrocks.common.UserException: No database selected."[\n]"
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG wire: http-outgoing-72 << "}"
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG headers: http-outgoing-72 << HTTP/1.1 200 OK
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG headers: http-outgoing-72 << content-length: 104
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG headers: http-outgoing-72 << content-type: text/html
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG headers: http-outgoing-72 << connection: keep-alive
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG MainClientExec: Connection can be kept alive indefinitely
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG PoolingHttpClientConnectionManager: Connection [id: 72][route: {}->http://kube-starrocks-fe-service.starrocks.svc.cluster.local:8030] can be kept alive indefinitely
[2024-11-22, 11:58:08 UTC] {spark_submit.py:490} INFO - 24/11/22 11:58:08 DEBUG DefaultManagedHttpClientConnection: http-outgoing-72: set socket timeout to 0
Under these conditions, the process continues making insertions correctly until at some point it fails with the "Connection timed out" error.
Title:
FE Connection Timeout Error using starrocks-spark-connector-3.4_2.12:1.1.2 stream load
Description:
We are experiencing connection timeout issues when using the
starrocks-spark-connector-3.4_2.12:1.1.2
while attempting to load data from Spark DataFrames into StarRocks. The Spark session reads data in batches from a RabbitMQ queue and tries to write the data into StarRocks using the connector's stream load functionality as described in the StarRocks documentation. The issue arises after several hours or sometimes days of operation, causing the connection to StarRocks frontend services to fail.Steps to reproduce the behavior (Required)
We are using a Spark session that reads data in batches from a RabbitMQ queue and tries to write the data into StarRocks using the StarRocks Spark connector stream load functionality.
Expected behavior (Required)
Data should be loaded into StarRocks without connection timeout issues when using the starrocks-spark-connector.
Real behavior (Required)
Connection to StarRocks frontend services fails after prolonged periods of operation, causing data loading to halt.
Connector Configuration:
Error logs:
In
fe.log
, the error is:Alternate Configuration Attempt Using FE Proxy:
We also tried using the following configuration through a proxy service, but the issue persists. The error here is different, resulting in a gateway error (504):
Additional error log for proxy configuration:
StarRocks Version (Required)
3.1.11