[Bug][Master] Failed to obtain JDBC Connection; nested exception is com.alibaba.druid.pool.DataSourceClosedException: dataSource already closed

lishiyucn commented 3 years ago

dolphinscheduler1.3.6

[INFO] 2021-05-16 12:26:48.939 org.apache.zookeeper.ClientCnxn:[522] - EventThread shut down for session: 0x1788791943ba030
[INFO] 2021-05-16 12:26:48.940 org.apache.zookeeper.ZooKeeper:[693] - Session: 0x1788791943ba030 closed
[INFO] 2021-05-16 12:26:48.940 org.quartz.core.QuartzScheduler:[666] - Scheduler DolphinScheduler_$_bigdata1620904843731 shutting down.
[INFO] 2021-05-16 12:26:48.940 org.quartz.core.QuartzScheduler:[585] - Scheduler DolphinScheduler_$_bigdata1620904843731 paused.
[INFO] 2021-05-16 12:27:02.260 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[254] - master node : /dolphinscheduler/nodes/master/10.2.12.3:5678 added.
[INFO] 2021-05-16 12:27:02.262 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[215] - worker group node : /dolphinscheduler/nodes/worker/default/10.2.12.3:1234 added.
[INFO] 2021-05-16 12:27:02.263 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[218] - currentNodes : []
[INFO] 2021-05-16 12:27:02.263 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[258] - master node : /dolphinscheduler/nodes/master/10.2.12.3:5678 down.
[INFO] 2021-05-16 12:27:02.319 com.alibaba.druid.pool.DruidDataSource:[1928] - {dataSource-1} closed
[INFO] 2021-05-16 12:27:02.319 org.apache.dolphinscheduler.server.master.zk.ZKMasterClient:[360] - failover process instance id: 42 host:10.2.12.3:5678
[INFO] 2021-05-16 12:27:02.320 org.quartz.core.QuartzScheduler:[740] - Scheduler DolphinScheduler_$_bigdata1620904843731 shutdown complete.
[INFO] 2021-05-16 12:27:02.321 org.apache.dolphinscheduler.service.quartz.QuartzExecutors:[171] - Quartz service stopped, and halt all tasks
[INFO] 2021-05-16 12:27:02.321 org.apache.dolphinscheduler.server.master.MasterServer:[182] - Quartz service stopped
[ERROR] 2021-05-16 12:27:02.321 org.apache.dolphinscheduler.server.master.zk.ZKMasterClient:[167] - MASTER server failover failed.
[ERROR] 2021-05-16 12:27:02.322 org.apache.dolphinscheduler.server.master.zk.ZKMasterClient:[168] - failover exception
org.springframework.transaction.CannotCreateTransactionException: Could not open JDBC Connection for transaction; nested exception is com.alibaba.druid.pool.DataSourceClosedException: dataSource already closed at Sun May 16 12:27:02 CST 2021
        at org.springframework.jdbc.datasource.DataSourceTransactionManager.doBegin(DataSourceTransactionManager.java:304)
        at org.springframework.transaction.support.AbstractPlatformTransactionManager.getTransaction(AbstractPlatformTransactionManager.java:378)
        at org.springframework.transaction.interceptor.TransactionAspectSupport.createTransactionIfNecessary(TransactionAspectSupport.java:476)
        at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:289)
        at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:98)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
        at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:691)
        at org.apache.dolphinscheduler.service.process.ProcessService$$EnhancerBySpringCGLIB$$f6ae31de.processNeedFailoverProcessInstances(<generated>)
        at org.apache.dolphinscheduler.server.master.zk.ZKMasterClient.failoverMaster(ZKMasterClient.java:365)
        at org.apache.dolphinscheduler.server.master.zk.ZKMasterClient.failoverServerWhenDown(ZKMasterClient.java:183)
        at org.apache.dolphinscheduler.server.master.zk.ZKMasterClient.removeZKNodePath(ZKMasterClient.java:164)
        at org.apache.dolphinscheduler.server.master.zk.ZKMasterClient.handleMasterEvent(ZKMasterClient.java:222)
        at org.apache.dolphinscheduler.server.master.zk.ZKMasterClient.dataChanged(ZKMasterClient.java:129)
        at org.apache.dolphinscheduler.service.zk.ZookeeperCachedOperator.lambda$registerListener$0(ZookeeperCachedOperator.java:48)
        at org.apache.curator.framework.recipes.cache.TreeCache$2.apply(TreeCache.java:760)
        at org.apache.curator.framework.recipes.cache.TreeCache$2.apply(TreeCache.java:754)
        at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100)
at org.apache.curator.framework.recipes.cache.TreeCache$2.apply(TreeCache.java:754)
        at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100)
        at org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
        at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92)
        at org.apache.curator.framework.recipes.cache.TreeCache.callListeners(TreeCache.java:753)
        at org.apache.curator.framework.recipes.cache.TreeCache.access$1900(TreeCache.java:75)
        at org.apache.curator.framework.recipes.cache.TreeCache$4.run(TreeCache.java:865)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: com.alibaba.druid.pool.DataSourceClosedException: dataSource already closed at Sun May 16 12:27:02 CST 2021
        at com.alibaba.druid.pool.DruidDataSource.getConnectionInternal(DruidDataSource.java:1429)
        at com.alibaba.druid.pool.DruidDataSource.getConnectionDirect(DruidDataSource.java:1326)
        at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:1306)
        at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:1296)
        at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:109)
        at org.springframework.jdbc.datasource.DataSourceTransactionManager.doBegin(DataSourceTransactionManager.java:261)
        ... 26 common frames omitted
[ERROR] 2021-05-16 12:27:02.324 org.apache.dolphinscheduler.service.zk.AbstractZKClient:[299] - lock release failed
java.lang.IllegalStateException: Expected state [STARTED] was [STOPPED]
        at org.apache.curator.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:823)
        at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkState(CuratorFrameworkImpl.java:432)
        at org.apache.curator.framework.imps.CuratorFrameworkImpl.delete(CuratorFrameworkImpl.java:452)
        at org.apache.curator.framework.recipes.locks.LockInternals.deleteOurPath(LockInternals.java:347)

github-actions[bot] commented 3 years ago

Hi:

Thank you for your feedback, we have received your issue, Please wait patiently for a reply.
In order for us to understand your request as soon as possible, please provide detailed information、version or pictures.
If you haven't received a reply for a long time, you can subscribe to the developer's email，Mail subscription steps reference https://dolphinscheduler.apache.org/zh-cn/community/development/subscribe.html ,Then write the issue URL in the email content and send question to dev@dolphinscheduler.apache.org.

ruanwenjun commented 3 years ago

@lishiyucn Hi, you'd better format the log, it's hard to find the reason now. Does this exception occur in the master or worker module? Or how can this problem be reproduced? I think the possible reason is that you only start one node, and before the node receives the NODE_REMOVED event to send alert, it has closed the datasource. There are currently two places that will handle the NODE_REMOVE event. You can check if there is print some close in the log before the exception.

lishiyucn commented 3 years ago

@ruanwenjun updated log

log has print some close in the log before the exception: [INFO] 2021-05-16 12:27:02.319 com.alibaba.druid.pool.DruidDataSource:[1928] - {dataSource-1} closed
The environment is single node dolphinscheduler service; How should we solve this problem？

ruanwenjun commented 3 years ago

@lishiyucn This is an existing problem, I will try to send a pr to solve this problem by code. The impact of this issue may be that if you only have one instance and the alarm will send failed if the instance is lost.

lishiyucn commented 3 years ago

@ruanwenjun Can I solve this problem by backing back the dolphinscheduler version ? Now is dolphin1.3.6 ,I need dolphin version greater than dolphin1.3.5 for adapt to flink1.11

ruanwenjun commented 3 years ago

@lishiyucn Sorry, it may not be resolved by fallback the version. But I think this should not be a serious problem, the influence of this issue is limited.

lishiyucn commented 3 years ago

The dolphin master progress is survival

ruanwenjun commented 3 years ago

@lishiyucn I have submitted a pr #5497, you can help me review. And furthermore, you need to find out why the instance crashed. I can help you check, but you need to provide more log information. I guess it may be due to the fluctuation of the zookeeper connection, because currently, the zookeeper connection reconnect will cause instance crashed. #5211

lishiyucn commented 3 years ago

I fixed the code of https://github.com/ruanwenjun/incubator-dolphinscheduler/commit/4290a96bc7785fe7cde092a1c6cad68dbac40748 but still get the error "start process failed:master does not exist"

the master log is : `[INFO] 2021-07-19 13:15:11.779 org.apache.curator.framework.state.ConnectionStateManager:[251] - State change: RECONNECTED [INFO] 2021-07-19 13:15:11.779 org.apache.dolphinscheduler.service.zk.ZookeeperOperator:[85] - reconnected to zookeeper [INFO] 2021-07-19 13:16:12.006 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[225] - worker group node : /dolphinscheduler/nodes/worker/default/10.2.12.3:1234 down. [ERROR] 2021-07-19 13:16:12.010 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[234] - WorkerGroupListener capture data change and get data failed org.mybatis.spring.MyBatisSystemException: nested exception is org.apache.ibatis.exceptions.PersistenceException:

Error updating database. Cause: org.springframework.jdbc.CannotGetJdbcConnectionException: Failed to obtain JDBC Connection; nested exception is com.alibaba.druid.pool.DataSourceClosedException: dataSource already closed at Mon Jul 19 12:44:35 CST 2021

The error may exist in org/apache/dolphinscheduler/dao/mapper/AlertMapper.java (best guess)

The error may involve org.apache.dolphinscheduler.dao.mapper.AlertMapper.insert

The error occurred while executing an update

Cause: org.springframework.jdbc.CannotGetJdbcConnectionException: Failed to obtain JDBC Connection; nested exception is com.alibaba.druid.pool.DataSourceClosedException: dataSource already closed at Mon Jul 19 12:44:35 CST 2021

    at org.mybatis.spring.MyBatisExceptionTranslator.translateExceptionIfPossible(MyBatisExceptionTranslator.java:78)
    at org.mybatis.spring.SqlSessionTemplate$SqlSessionInterceptor.invoke(SqlSessionTemplate.java:440)
    at com.sun.proxy.$Proxy84.insert(Unknown Source)
    at org.mybatis.spring.SqlSessionTemplate.insert(SqlSessionTemplate.java:271)
    at com.baomidou.mybatisplus.core.override.MybatisMapperMethod.execute(MybatisMapperMethod.java:58)
    at com.baomidou.mybatisplus.core.override.MybatisMapperProxy.invoke(MybatisMapperProxy.java:61)
    at com.sun.proxy.$Proxy110.insert(Unknown Source)
    at org.apache.dolphinscheduler.dao.AlertDao.saveTaskTimeoutAlert(AlertDao.java:135)
    at org.apache.dolphinscheduler.dao.AlertDao.sendServerStopedAlert(AlertDao.java:102)
    at org.apache.dolphinscheduler.server.master.registry.ServerNodeManager$WorkerGroupNodeListener.dataChanged(ServerNodeManager.java:229)
    at org.apache.dolphinscheduler.service.zk.AbstractListener.childEvent(AbstractListener.java:41)
    at org.apache.dolphinscheduler.service.zk.ZookeeperCachedOperator.lambda$treeCacheStart$0(ZookeeperCachedOperator.java:70)
    at org.apache.curator.framework.recipes.cache.TreeCache$2.apply(TreeCache.java:760)
    at org.apache.curator.framework.recipes.cache.TreeCache$2.apply(TreeCache.java:754)
    at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100)
    at org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
    at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92)
    at org.apache.curator.framework.recipes.cache.TreeCache.callListeners(TreeCache.java:753)
    at org.apache.curator.framework.recipes.cache.TreeCache.access$1900(TreeCache.java:75)
    at org.apache.curator.framework.recipes.cache.TreeCache$4.run(TreeCache.java:865)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Caused by: org.apache.ibatis.exceptions.PersistenceException:

Error updating database. Cause: org.springframework.jdbc.CannotGetJdbcConnectionException: Failed to obtain JDBC Connection; nested exception is com.alibaba.druid.pool.DataSourceClosedException: dataSource already closed at Mon Jul 19 12:44:35 CST 2021

The error may exist in org/apache/dolphinscheduler/dao/mapper/AlertMapper.java (best guess)

The error may involve org.apache.dolphinscheduler.dao.mapper.AlertMapper.insert

The error occurred while executing an update

Cause: org.springframework.jdbc.CannotGetJdbcConnectionException: Failed to obtain JDBC Connection; nested exception is com.alibaba.druid.pool.DataSourceClosedException: dataSource already closed at Mon Jul 19 12:44:35 CST 2021

    at org.apache.ibatis.exceptions.ExceptionFactory.wrapException(ExceptionFactory.java:30)
    at org.apache.ibatis.session.defaults.DefaultSqlSession.update(DefaultSqlSession.java:199)
    at org.apache.ibatis.session.defaults.DefaultSqlSession.insert(DefaultSqlSession.java:184)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.mybatis.spring.SqlSessionTemplate$SqlSessionInterceptor.invoke(SqlSessionTemplate.java:426)
    ... 23 common frames omitted

Caused by: org.springframework.jdbc.CannotGetJdbcConnectionException: Failed to obtain JDBC Connection; nested exception is com.alibaba.druid.pool.DataSourceClosedException: dataSource already closed at Mon Jul 19 12:44:35 CST 2021 at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:82) at org.mybatis.spring.transaction.SpringManagedTransaction.openConnection(SpringManagedTransaction.java:80) at org.mybatis.spring.transaction.SpringManagedTransaction.getConnection(SpringManagedTransaction.java:67) at org.apache.ibatis.executor.BaseExecutor.getConnection(BaseExecutor.java:336) at com.baomidou.mybatisplus.core.executor.MybatisSimpleExecutor.prepareStatement(MybatisSimpleExecutor.java:93) at com.baomidou.mybatisplus.core.executor.MybatisSimpleExecutor.doUpdate(MybatisSimpleExecutor.java:53) at org.apache.ibatis.executor.BaseExecutor.update(BaseExecutor.java:117) at org.apache.ibatis.session.defaults.DefaultSqlSession.update(DefaultSqlSession.java:197) ... 29 common frames omitted Caused by: com.alibaba.druid.pool.DataSourceClosedException: dataSource already closed at Mon Jul 19 12:44:35 CST 2021 at com.alibaba.druid.pool.DruidDataSource.getConnectionInternal(DruidDataSource.java:1429) at com.alibaba.druid.pool.DruidDataSource.getConnectionDirect(DruidDataSource.java:1326) at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:1306) at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:1296) at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:109) at org.springframework.jdbc.datasource.DataSourceUtils.fetchConnection(DataSourceUtils.java:158) at org.springframework.jdbc.datasource.DataSourceUtils.doGetConnection(DataSourceUtils.java:116) at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:79) ... 36 common frames omitted `

the dolphin worker err is :\

[INFO] 2021-07-19 12:46:32.055 org.apache.zookeeper.ClientCnxn:[1162] - Socket error occurred: bigdata/10.2.12.3:2181: Connection refused [INFO] 2021-07-19 12:46:32.076 org.apache.zookeeper.ClientCnxn:[1025] - Opening socket connection to server bigdata/10.2.12.3:2181. Will not attempt to authenticate using SASL (unknown error) [INFO] 2021-07-19 12:46:32.076 org.apache.zookeeper.ClientCnxn:[1162] - Socket error occurred: bigdata/10.2.12.3:2181: Connection refused [INFO] 2021-07-19 12:46:32.215 org.apache.zookeeper.ClientCnxn:[1025] - Opening socket connection to server bigdata/10.2.12.3:2181. Will not attempt to authenticate using SASL (unknown error) [INFO] 2021-07-19 12:46:32.215 org.apache.zookeeper.ClientCnxn:[1162] - Socket error occurred: bigdata/10.2.12.3:2181: Connection refused [INFO] 2021-07-19 12:46:32.937 org.apache.zookeeper.ClientCnxn:[1025] - Opening socket connection to server bigdata/10.2.12.3:2181. Will not attempt to authenticate using SASL (unknown error) [INFO] 2021-07-19 12:46:32.938 org.apache.zookeeper.ClientCnxn:[1162] - Socket error occurred: bigdata/10.2.12.3:2181: Connection refused [INFO] 2021-07-19 12:46:33.155 org.apache.zookeeper.ClientCnxn:[1025] - Opening socket connection to server bigdata/10.2.12.3:2181. Will not attempt to authenticate using SASL (unknown error) [INFO] 2021-07-19 12:46:33.156 org.apache.zookeeper.ClientCnxn:[1162] - Socket error occurred: bigdata/10.2.12.3:2181: Connection refused [INFO] 2021-07-19 12:46:33.177 org.apache.zookeeper.ClientCnxn:[1025] - Opening socket connection to server bigdata/10.2.12.3:2181. Will not attempt to authenticate using SASL (unknown error) [INFO] 2021-07-19 12:46:33.177 org.apache.zookeeper.ClientCnxn:[1162] - Socket error occurred: bigdata/10.2.12.3:2181: Connection refused [INFO] 2021-07-19 12:46:33.316 org.apache.zookeeper.ClientCnxn:[1025] - Opening socket connection to server bigdata/10.2.12.3:2181. Will not attempt to authenticate using SASL (unknown error) [INFO] 2021-07-19 12:46:33.316 org.apache.zookeeper.ClientCnxn:[1162] - Socket error occurred: bigdata/10.2.12.3:2181: Connection refused [INFO] 2021-07-19 12:46:33.614 org.apache.dolphinscheduler.remote.NettyRemotingClient:[403] - netty client closed [INFO] 2021-07-19 12:46:33.614 org.apache.dolphinscheduler.service.log.LogClientService:[59] - logger client closed

lishiyucn commented 3 years ago

@chengshiwen @ruanwenjun Please have attention,Thanks!``

ruanwenjun commented 3 years ago

@lishiyucn Yes, when you reconnected, the master is still down, you can get some detail in #5210

lishiyucn commented 3 years ago

I get the #5210 git pull https://github.com/apache/dolphinscheduler/pull/5211/files https://github.com/apache/dolphinscheduler/pull/5879/files

If I fix the code, would the problem be solved? @ruanwenjun

CalvinKirs commented 3 years ago

closed. release on 1.3.7

wanggx commented 2 years ago

I also meet the same erroe message , and the datasource is closed when the server is judged dead。 the code close the datasource in DruidConnectionProvider.java , my version is 1.3.6。

apache / dolphinscheduler