Closed lishiyucn closed 3 years ago
Hi:
@lishiyucn Hi, you'd better format the log, it's hard to find the reason now. Does this exception occur in the master or worker module? Or how can this problem be reproduced?
I think the possible reason is that you only start one node, and before the node receives the NODE_REMOVED
event to send alert, it has closed the datasource. There are currently two places that will handle the NODE_REMOVE
event. You can check if there is print some close in the log before the exception.
@ruanwenjun updated log
@lishiyucn This is an existing problem, I will try to send a pr to solve this problem by code. The impact of this issue may be that if you only have one instance and the alarm will send failed if the instance is lost.
@ruanwenjun Can I solve this problem by backing back the dolphinscheduler version ? Now is dolphin1.3.6 ,I need dolphin version greater than dolphin1.3.5 for adapt to flink1.11
@lishiyucn Sorry, it may not be resolved by fallback the version. But I think this should not be a serious problem, the influence of this issue is limited.
The dolphin master progress is survival
@lishiyucn I have submitted a pr #5497, you can help me review. And furthermore, you need to find out why the instance crashed. I can help you check, but you need to provide more log information. I guess it may be due to the fluctuation of the zookeeper connection, because currently, the zookeeper connection reconnect will cause instance crashed. #5211
I fixed the code of https://github.com/ruanwenjun/incubator-dolphinscheduler/commit/4290a96bc7785fe7cde092a1c6cad68dbac40748 but still get the error "start process failed:master does not exist"
the master log is : `[INFO] 2021-07-19 13:15:11.779 org.apache.curator.framework.state.ConnectionStateManager:[251] - State change: RECONNECTED [INFO] 2021-07-19 13:15:11.779 org.apache.dolphinscheduler.service.zk.ZookeeperOperator:[85] - reconnected to zookeeper [INFO] 2021-07-19 13:16:12.006 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[225] - worker group node : /dolphinscheduler/nodes/worker/default/10.2.12.3:1234 down. [ERROR] 2021-07-19 13:16:12.010 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[234] - WorkerGroupListener capture data change and get data failed org.mybatis.spring.MyBatisSystemException: nested exception is org.apache.ibatis.exceptions.PersistenceException:
at org.mybatis.spring.MyBatisExceptionTranslator.translateExceptionIfPossible(MyBatisExceptionTranslator.java:78)
at org.mybatis.spring.SqlSessionTemplate$SqlSessionInterceptor.invoke(SqlSessionTemplate.java:440)
at com.sun.proxy.$Proxy84.insert(Unknown Source)
at org.mybatis.spring.SqlSessionTemplate.insert(SqlSessionTemplate.java:271)
at com.baomidou.mybatisplus.core.override.MybatisMapperMethod.execute(MybatisMapperMethod.java:58)
at com.baomidou.mybatisplus.core.override.MybatisMapperProxy.invoke(MybatisMapperProxy.java:61)
at com.sun.proxy.$Proxy110.insert(Unknown Source)
at org.apache.dolphinscheduler.dao.AlertDao.saveTaskTimeoutAlert(AlertDao.java:135)
at org.apache.dolphinscheduler.dao.AlertDao.sendServerStopedAlert(AlertDao.java:102)
at org.apache.dolphinscheduler.server.master.registry.ServerNodeManager$WorkerGroupNodeListener.dataChanged(ServerNodeManager.java:229)
at org.apache.dolphinscheduler.service.zk.AbstractListener.childEvent(AbstractListener.java:41)
at org.apache.dolphinscheduler.service.zk.ZookeeperCachedOperator.lambda$treeCacheStart$0(ZookeeperCachedOperator.java:70)
at org.apache.curator.framework.recipes.cache.TreeCache$2.apply(TreeCache.java:760)
at org.apache.curator.framework.recipes.cache.TreeCache$2.apply(TreeCache.java:754)
at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100)
at org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92)
at org.apache.curator.framework.recipes.cache.TreeCache.callListeners(TreeCache.java:753)
at org.apache.curator.framework.recipes.cache.TreeCache.access$1900(TreeCache.java:75)
at org.apache.curator.framework.recipes.cache.TreeCache$4.run(TreeCache.java:865)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.ibatis.exceptions.PersistenceException:
at org.apache.ibatis.exceptions.ExceptionFactory.wrapException(ExceptionFactory.java:30)
at org.apache.ibatis.session.defaults.DefaultSqlSession.update(DefaultSqlSession.java:199)
at org.apache.ibatis.session.defaults.DefaultSqlSession.insert(DefaultSqlSession.java:184)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.mybatis.spring.SqlSessionTemplate$SqlSessionInterceptor.invoke(SqlSessionTemplate.java:426)
... 23 common frames omitted
Caused by: org.springframework.jdbc.CannotGetJdbcConnectionException: Failed to obtain JDBC Connection; nested exception is com.alibaba.druid.pool.DataSourceClosedException: dataSource already closed at Mon Jul 19 12:44:35 CST 2021 at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:82) at org.mybatis.spring.transaction.SpringManagedTransaction.openConnection(SpringManagedTransaction.java:80) at org.mybatis.spring.transaction.SpringManagedTransaction.getConnection(SpringManagedTransaction.java:67) at org.apache.ibatis.executor.BaseExecutor.getConnection(BaseExecutor.java:336) at com.baomidou.mybatisplus.core.executor.MybatisSimpleExecutor.prepareStatement(MybatisSimpleExecutor.java:93) at com.baomidou.mybatisplus.core.executor.MybatisSimpleExecutor.doUpdate(MybatisSimpleExecutor.java:53) at org.apache.ibatis.executor.BaseExecutor.update(BaseExecutor.java:117) at org.apache.ibatis.session.defaults.DefaultSqlSession.update(DefaultSqlSession.java:197) ... 29 common frames omitted Caused by: com.alibaba.druid.pool.DataSourceClosedException: dataSource already closed at Mon Jul 19 12:44:35 CST 2021 at com.alibaba.druid.pool.DruidDataSource.getConnectionInternal(DruidDataSource.java:1429) at com.alibaba.druid.pool.DruidDataSource.getConnectionDirect(DruidDataSource.java:1326) at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:1306) at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:1296) at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:109) at org.springframework.jdbc.datasource.DataSourceUtils.fetchConnection(DataSourceUtils.java:158) at org.springframework.jdbc.datasource.DataSourceUtils.doGetConnection(DataSourceUtils.java:116) at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:79) ... 36 common frames omitted `
the dolphin worker err is :\
[INFO] 2021-07-19 12:46:32.055 org.apache.zookeeper.ClientCnxn:[1162] - Socket error occurred: bigdata/10.2.12.3:2181: Connection refused [INFO] 2021-07-19 12:46:32.076 org.apache.zookeeper.ClientCnxn:[1025] - Opening socket connection to server bigdata/10.2.12.3:2181. Will not attempt to authenticate using SASL (unknown error) [INFO] 2021-07-19 12:46:32.076 org.apache.zookeeper.ClientCnxn:[1162] - Socket error occurred: bigdata/10.2.12.3:2181: Connection refused [INFO] 2021-07-19 12:46:32.215 org.apache.zookeeper.ClientCnxn:[1025] - Opening socket connection to server bigdata/10.2.12.3:2181. Will not attempt to authenticate using SASL (unknown error) [INFO] 2021-07-19 12:46:32.215 org.apache.zookeeper.ClientCnxn:[1162] - Socket error occurred: bigdata/10.2.12.3:2181: Connection refused [INFO] 2021-07-19 12:46:32.937 org.apache.zookeeper.ClientCnxn:[1025] - Opening socket connection to server bigdata/10.2.12.3:2181. Will not attempt to authenticate using SASL (unknown error) [INFO] 2021-07-19 12:46:32.938 org.apache.zookeeper.ClientCnxn:[1162] - Socket error occurred: bigdata/10.2.12.3:2181: Connection refused [INFO] 2021-07-19 12:46:33.155 org.apache.zookeeper.ClientCnxn:[1025] - Opening socket connection to server bigdata/10.2.12.3:2181. Will not attempt to authenticate using SASL (unknown error) [INFO] 2021-07-19 12:46:33.156 org.apache.zookeeper.ClientCnxn:[1162] - Socket error occurred: bigdata/10.2.12.3:2181: Connection refused [INFO] 2021-07-19 12:46:33.177 org.apache.zookeeper.ClientCnxn:[1025] - Opening socket connection to server bigdata/10.2.12.3:2181. Will not attempt to authenticate using SASL (unknown error) [INFO] 2021-07-19 12:46:33.177 org.apache.zookeeper.ClientCnxn:[1162] - Socket error occurred: bigdata/10.2.12.3:2181: Connection refused [INFO] 2021-07-19 12:46:33.316 org.apache.zookeeper.ClientCnxn:[1025] - Opening socket connection to server bigdata/10.2.12.3:2181. Will not attempt to authenticate using SASL (unknown error) [INFO] 2021-07-19 12:46:33.316 org.apache.zookeeper.ClientCnxn:[1162] - Socket error occurred: bigdata/10.2.12.3:2181: Connection refused [INFO] 2021-07-19 12:46:33.614 org.apache.dolphinscheduler.remote.NettyRemotingClient:[403] - netty client closed [INFO] 2021-07-19 12:46:33.614 org.apache.dolphinscheduler.service.log.LogClientService:[59] - logger client closed
@chengshiwen @ruanwenjun Please have attention,Thanks!``
@lishiyucn Yes, when you reconnected, the master is still down, you can get some detail in #5210
I get the #5210 git pull https://github.com/apache/dolphinscheduler/pull/5211/files https://github.com/apache/dolphinscheduler/pull/5879/files
If I fix the code, would the problem be solved? @ruanwenjun
closed. release on 1.3.7
I also meet the same erroe message , and the datasource is closed when the server is judged dead。 the code close the datasource in DruidConnectionProvider.java , my version is 1.3.6。
dolphinscheduler1.3.6