apache / doris

Apache Doris is an easy-to-use, high performance and unified analytics database.
https://doris.apache.org
Apache License 2.0
12.39k stars 3.22k forks source link

[Bug] GetConnectionTimeoutException when Querying External Table #33380

Closed IanMeta closed 1 week ago

IanMeta commented 6 months ago

Search before asking

Version

Doris Version: 2.1.1 (recently upgraded from 2.0.0) Java Version: 1.8.0_402

What's Wrong?

When querying JDBC external tables in 2.1.1, we reached this error after a while:

org.jkiss.dbeaver.model.sql.DBSQLException: SQL Error [1105] [HY000]: errCode = 2, detailMessage = (xx.xxx.xxx.xxx)[CANCELLED]UdfRuntimeException: Initialize datasource failed: 
CAUSED BY: GetConnectionTimeoutException: wait millis 5002, active 10, maxActive 10, creating 0
    at org.jkiss.dbeaver.model.impl.jdbc.exec.JDBCStatementImpl.executeStatement(JDBCStatementImpl.java:133)
    at org.jkiss.dbeaver.ui.editors.sql.execute.SQLQueryJob.executeStatement(SQLQueryJob.java:583)
    at org.jkiss.dbeaver.ui.editors.sql.execute.SQLQueryJob.lambda$1(SQLQueryJob.java:492)
    at org.jkiss.dbeaver.model.exec.DBExecUtils.tryExecuteRecover(DBExecUtils.java:190)
    at org.jkiss.dbeaver.ui.editors.sql.execute.SQLQueryJob.executeSingleQuery(SQLQueryJob.java:499)
    at org.jkiss.dbeaver.ui.editors.sql.execute.SQLQueryJob.extractData(SQLQueryJob.java:947)
    at org.jkiss.dbeaver.ui.editors.sql.SQLEditor$QueryResultsContainer.readData(SQLEditor.java:4057)
    at org.jkiss.dbeaver.ui.controls.resultset.ResultSetJobDataRead.lambda$0(ResultSetJobDataRead.java:123)
    at org.jkiss.dbeaver.model.exec.DBExecUtils.tryExecuteRecover(DBExecUtils.java:190)
    at org.jkiss.dbeaver.ui.controls.resultset.ResultSetJobDataRead.run(ResultSetJobDataRead.java:121)
    at org.jkiss.dbeaver.ui.controls.resultset.ResultSetViewer$ResultSetDataPumpJob.run(ResultSetViewer.java:5164)
    at org.jkiss.dbeaver.model.runtime.AbstractJob.run(AbstractJob.java:105)
    at org.eclipse.core.internal.jobs.Worker.run(Worker.java:63)
Caused by: java.sql.SQLException: errCode = 2, detailMessage = (xx.xxx.xxx.xxx)[CANCELLED]UdfRuntimeException: Initialize datasource failed: 
CAUSED BY: GetConnectionTimeoutException: wait millis 5002, active 10, maxActive 10, creating 0
    at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:129)
    at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122)
    at com.mysql.cj.jdbc.StatementImpl.executeInternal(StatementImpl.java:763)
    at com.mysql.cj.jdbc.StatementImpl.execute(StatementImpl.java:648)
    at org.jkiss.dbeaver.model.impl.jdbc.exec.JDBCStatementImpl.execute(JDBCStatementImpl.java:330)
    at org.jkiss.dbeaver.model.impl.jdbc.exec.JDBCStatementImpl.executeStatement(JDBCStatementImpl.java:131)
    ... 12 more

What You Expected?

successfully querying the external table.

How to Reproduce?

  1. In Doris 2.1.1, create a JDBC connection resource and add a valid external table
  2. Open multiple connections to Doris and query the external table.

Anything Else?

Here are the following things we tried:

  1. Increasing connection_pool_max_size, helped to increase the time before we hit this error again, but even increasing the max pool size to 10000, we still eventually hit this error. We suspect that there is a leak and the connection pool management.
  2. Decreasing connection_pool_max_life_time and jdbc_connection_pool_cache_clear_time_sec also seemed to help increase the number of queries before we hit the error again, but also do not solve the issue ultimately.
  3. Interestingly, when we update connection_pool_max_size to any value (could be smaller than the previously set value), Doris seems to clear all the useless connections from before and we can immediately query the external table again.

We have not tried: Installing Doris 2.1.1 from fresh to reproduce this error, the bug may be a result of unexpected behaviors from the version upgrade from 2.0.0

Are you willing to submit PR?

Code of Conduct

zy-kkk commented 6 months ago

I would like to know the following situations

  1. How many queries for external catalog can you have at most at the same time?
  2. Are you using the alter catalog statement to adjust connection_pool related parameters?
  3. After adjusting connection_pool_max_size to 10000 Is the error reported a new CAUSED BY: GetConnectionTimeoutException: wait millis 5002, active 10000, maxActive 10000, creating 0, or is the error still CAUSED BY: GetConnectionTimeoutException: wait millis 5002, active 10, maxActive 10, creating 0. Pay attention to the difference between 10 and 10000. the difference
  4. Are you in a multi-FE environment? When an error occurs, have you paid attention to whether the connected FE is the master FE?
  5. Have you tried re-creating a catalog and specifying connection_pool_max_size as a larger value, such as 100, in the properties of the created catalog? Then run it for a period of time and observe whether there are the following errors: CAUSED BY: GetConnectionTimeoutException: wait millis 5002, active 100, maxActive 100, creating 0
IanMeta commented 6 months ago

I would like to know the following situations

  1. How many queries for external catalog can you have at most at the same time?
  2. Are you using the alter catalog statement to adjust connection_pool related parameters?
  3. After adjusting connection_pool_max_size to 10000 Is the error reported a new CAUSED BY: GetConnectionTimeoutException: wait millis 5002, active 10000, maxActive 10000, creating 0, or is the error still CAUSED BY: GetConnectionTimeoutException: wait millis 5002, active 10, maxActive 10, creating 0. Pay attention to the difference between 10 and 10000. the difference
  4. Are you in a multi-FE environment? When an error occurs, have you paid attention to whether the connected FE is the master FE?
  5. Have you tried re-creating a catalog and specifying connection_pool_max_size as a larger value, such as 100, in the properties of the created catalog? Then run it for a period of time and observe whether there are the following errors: CAUSED BY: GetConnectionTimeoutException: wait millis 5002, active 100, maxActive 100, creating 0
  1. Majority of queries in our production requirements require joins from external tables. In most cases we have no more than 50 queries at the same time.
  2. Yes. Specifically, we use the statement ALTER RESOURCE abc PROPERTIES("connection_pool_max_size"="1000"). After updating the value to a different one, we were immediately able to query the external table under this resource regardless to the size we set it to.
  3. The former. The new error would be CAUSED BY: GetConnectionTimeoutException: wait millis 5002, active 10000, maxActive 10000.
  4. Yes, currently we have 3 FEs, all followers. We observed the same error regardless if we were connected to the master node. We also see that the IP reported from the error is always from the BE nodes. Furthermore, we hit the same error when we only had 1 FE.
  5. Yes. We observed that dropping and re-adding the resource and table has the same effect as using the alter statements.
zy-kkk commented 6 months ago

I would like to know the following situations

  1. How many queries for external catalog can you have at most at the same time?
  2. Are you using the alter catalog statement to adjust connection_pool related parameters?
  3. After adjusting connection_pool_max_size to 10000 Is the error reported a new CAUSED BY: GetConnectionTimeoutException: wait millis 5002, active 10000, maxActive 10000, creating 0, or is the error still CAUSED BY: GetConnectionTimeoutException: wait millis 5002, active 10, maxActive 10, creating 0. Pay attention to the difference between 10 and 10000. the difference
  4. Are you in a multi-FE environment? When an error occurs, have you paid attention to whether the connected FE is the master FE?
  5. Have you tried re-creating a catalog and specifying connection_pool_max_size as a larger value, such as 100, in the properties of the created catalog? Then run it for a period of time and observe whether there are the following errors: CAUSED BY: GetConnectionTimeoutException: wait millis 5002, active 100, maxActive 100, creating 0
  1. Majority of queries in our production requirements require joins from external tables. In most cases we have no more than 50 queries at the same time.
  2. Yes. Specifically, we use the statement ALTER RESOURCE abc PROPERTIES("connection_pool_max_size"="1000"). After updating the value to a different one, we were immediately able to query the external table under this resource regardless to the size we set it to.
  3. The former. The new error would be CAUSED BY: GetConnectionTimeoutException: wait millis 5002, active 10000, maxActive 10000.
  4. Yes, currently we have 3 FEs, all followers. We observed the same error regardless if we were connected to the master node. We also see that the IP reported from the error is always from the BE nodes. Furthermore, we hit the same error when we only had 1 FE.
  5. Yes. We observed that dropping and re-adding the resource and table has the same effect as using the alter statements.

This seems a bit tricky. I also tested it using the method you mentioned, but I did not reproduce the problem. Is there any way I can contact you so that we can synchronize in time?

IanMeta commented 5 months ago

I would like to know the following situations

  1. How many queries for external catalog can you have at most at the same time?
  2. Are you using the alter catalog statement to adjust connection_pool related parameters?
  3. After adjusting connection_pool_max_size to 10000 Is the error reported a new CAUSED BY: GetConnectionTimeoutException: wait millis 5002, active 10000, maxActive 10000, creating 0, or is the error still CAUSED BY: GetConnectionTimeoutException: wait millis 5002, active 10, maxActive 10, creating 0. Pay attention to the difference between 10 and 10000. the difference
  4. Are you in a multi-FE environment? When an error occurs, have you paid attention to whether the connected FE is the master FE?
  5. Have you tried re-creating a catalog and specifying connection_pool_max_size as a larger value, such as 100, in the properties of the created catalog? Then run it for a period of time and observe whether there are the following errors: CAUSED BY: GetConnectionTimeoutException: wait millis 5002, active 100, maxActive 100, creating 0
  1. Majority of queries in our production requirements require joins from external tables. In most cases we have no more than 50 queries at the same time.
  2. Yes. Specifically, we use the statement ALTER RESOURCE abc PROPERTIES("connection_pool_max_size"="1000"). After updating the value to a different one, we were immediately able to query the external table under this resource regardless to the size we set it to.
  3. The former. The new error would be CAUSED BY: GetConnectionTimeoutException: wait millis 5002, active 10000, maxActive 10000.
  4. Yes, currently we have 3 FEs, all followers. We observed the same error regardless if we were connected to the master node. We also see that the IP reported from the error is always from the BE nodes. Furthermore, we hit the same error when we only had 1 FE.
  5. Yes. We observed that dropping and re-adding the resource and table has the same effect as using the alter statements.

This seems a bit tricky. I also tested it using the method you mentioned, but I did not reproduce the problem. Is there any way I can contact you so that we can synchronize in time?

I was kind-of able to reproduce it using a fresh 2.1.1 by spamming with multiple connections, but I had to use different queries, otherwise, it seems to only use 1 connection. The thing after I hit this error, the state of the "active connection count" seemed to be stuck even when i closed the previous connections.

Anyways, we can discuss on 微信 if you'd like, thanks! id: feer4847

ziyanTOP commented 5 months ago

I've encountered the same problem and can reproduce it through a demo project, please add me on wechat, id: ziyanTOP @zy-kkk

lvchaoin commented 4 months ago

same problem.

paul1989889 commented 3 months ago

same problem.

zy-kkk commented 1 week ago

Version 2.1.6 fixes all known connection issues. Please upgrade and test. This issue has been unanswered for a long time. I will close it first. If there are still problems later, please feel free to open it.

Raven888888 commented 1 week ago

@zy-kkk Does 2.1.6 resolve https://github.com/apache/doris/issues/34168?

zy-kkk commented 1 week ago

@zy-kkk Does 2.1.6 resolve #34168?

Are you experiencing memory leaks due to using Jdbc Catalog? If so, a fix will be released in 2.1.7

Raven888888 commented 1 week ago

@zy-kkk Does 2.1.6 resolve #34168?

Are you experiencing memory leaks due to using Jdbc Catalog? If so, a fix will be released in 2.1.7

Yes, but we are using JDBC external table, not Catalog. I think the underlying mechanism is the same.

Please let me know which commit fixes this memory leak in 2.1.7, thanks!