alibaba / otter

阿里巴巴分布式数据库同步系统(解决中美异地机房)
Apache License 2.0
8.07k stars 2.49k forks source link

batch延迟抖动 #558

Open deep011 opened 6 years ago

deep011 commented 6 years ago

我们线上出现个问题,pipeline的delay时不时会超过60秒,现在定位到是selectTask类中,processTermin()函数在等待termin信号时,等不到termin信号,而重试了30次(sleep了30秒),结束后,又sleep了30秒导致的,如下图: image

startProcessTermin()函数中 image

通过开启dumpEvent日志观察,相关的processId已经load成功了,但termin信号没有创建

现在怀疑可能跟如下代码有关,需要帮忙确认下 @agapple image

deep011 commented 6 years ago

日志中没有发现其他异常,compareReply看上去,只会删除progress中比zk中所有processId都小的,好像也不是这里引起的问题

deep011 commented 6 years ago

相关日志:

2018-07-08 17:43:23.780 [Arbitrate-Async-Watcher-11] WARN  c.a.o.shared.arbitrate.impl.setl.monitor.MainstemMonitor - handleDataChange pipeline 41, nid 14
2018-07-08 17:47:28.538 [Otter-Arbitrate-Executor-3] WARN  c.a.o.s.a.impl.setl.rpc.monitor.AbstractProcessListener - process is not in order, please check processId:170540066
2018-07-08 17:47:28.538 [Otter-Arbitrate-Executor-3] WARN  c.a.o.s.a.impl.setl.rpc.monitor.AbstractProcessListener - ## SelectProcessListener dup reply id [170540066]
2018-07-08 17:59:53.780 [Arbitrate-Async-Watcher-8] WARN  c.a.o.shared.arbitrate.impl.setl.monitor.MainstemMonitor - handleDataChange pipeline 41, nid 14
2018-07-08 18:17:23.780 [Arbitrate-Async-Watcher-7] WARN  c.a.o.shared.arbitrate.impl.setl.monitor.MainstemMonitor - handleDataChange pipeline 41, nid 14
2018-07-08 18:29:18.198 [pipelineId = 41,taskName = ProcessTermin] WARN  com.alibaba.otter.node.etl.select.SelectTask - wait for 30 seconds to process termin, batchId = 205074, processId = 170574226, termin =  TerminEventData[type=NORMAL,code=setl,desc=,currNid=17,processId=170574227,startTime=1531045727731,endTime=1531045727749,firstTime=1531045727000,batchId=205075,number=1,size=<null>,exts=<null>,pipelineId=41]
2018-07-08 18:29:18.198 [pipelineId = 41,taskName = ProcessTermin] WARN  com.alibaba.otter.node.etl.select.SelectTask - [41] ProcessTermin has an error! retry...
com.alibaba.otter.node.etl.select.exceptions.SelectException: unmatched terminId, SelectTask batchId = 205074 processId = 170574226 and Termin Event: TerminEventData[type=NORMAL,code=setl,desc=,currNid=17,processId=170574227,startTime=1531045727731,endTime=1531045727749,firstTime=1531045727000,batchId=205075,number=1,size=<null>,exts=<null>,pipelineId=41]
        at com.alibaba.otter.node.etl.select.SelectTask.processTermin(SelectTask.java:470) ~[node.etl-0.7.1-pdd.jar:na]
        at com.alibaba.otter.node.etl.select.SelectTask.access$3600(SelectTask.java:94) ~[node.etl-0.7.1-pdd.jar:na]
        at com.alibaba.otter.node.etl.select.SelectTask$3.run(SelectTask.java:406) ~[node.etl-0.7.1-pdd.jar:na]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_102]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_102]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_102]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_102]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_102]
2018-07-08 18:29:18.198 [pipelineId = 41,taskName = ProcessTermin] WARN  com.alibaba.otter.node.etl.select.SelectTask - notifyRollback happened, pipelineId: 41
2018-07-08 18:29:18.198 [pipelineId = 41,taskName = ProcessTermin] WARN  com.alibaba.otter.node.etl.select.SelectTask - Pipeline 41 sleep 30 seconds to wait for the termin ready.
2018-07-08 18:29:48.244 [pipelineId = 41,taskName = ProcessSelect] WARN  com.alibaba.otter.node.etl.select.SelectTask - batchTermin put into the batchBuffer(len:1) used too long time(57146), pipelineId: 41, beginTimeBatchBufferPut: 1531045731098, processId: 170574276
2018-07-08 18:29:48.262 [pipelineId = 41,taskName = ProcessTermin] WARN  com.alibaba.otter.node.etl.select.SelectTask - [41] ProcessTermin has an error! retry...
com.alibaba.otter.canal.server.exception.CanalServerException: ack error , clientId:41 batchId:205125 is not exist , please check
2018-07-08 18:29:48.262 [pipelineId = 41,taskName = ProcessTermin] WARN  com.alibaba.otter.node.etl.select.SelectTask - notifyRollback happened, pipelineId: 41
2018-07-08 18:29:48.262 [pipelineId = 41,taskName = ProcessTermin] WARN  com.alibaba.otter.node.etl.select.SelectTask - Pipeline 41 sleep 30 seconds to wait for the termin ready.
2018-07-08 18:29:48.275 [pipelineId = 41,taskName = ProcessSelect] WARN  com.alibaba.otter.node.etl.select.SelectTask - notifyRollback happened, pipelineId: 41
2018-07-08 18:35:53.780 [Arbitrate-Async-Watcher-17] WARN  c.a.o.shared.arbitrate.impl.setl.monitor.MainstemMonitor - handleDataChange pipeline 41, nid 14 

row_select日志:

* Batch Id: [205072] ,total : [5] , normal : [3] , filter :[2] , Time : 2018-07-08 18:28:47:447
* Start : [mysql-bin.000278:160110814:1531045727000(2018-07-08 18:28:47)]
* End : [mysql-bin.000278:160111628:1531045727000(2018-07-08 18:28:47)]
****************************************************

****************************************************
* Batch Id: [205073] ,total : [3] , normal : [1] , filter :[2] , Time : 2018-07-08 18:28:47:609
* Start : [mysql-bin.000278:160111699:1531045727000(2018-07-08 18:28:47)]
* End : [mysql-bin.000278:160112022:1531045727000(2018-07-08 18:28:47)]
****************************************************

****************************************************
* Batch Id: [205074] ,total : [4] , normal : [2] , filter :[2] , Time : 2018-07-08 18:28:47:680
* Start : [mysql-bin.000278:160112093:1531045727000(2018-07-08 18:28:47)]
* End : [mysql-bin.000278:160112561:1531045727000(2018-07-08 18:28:47)]
****************************************************

****************************************************
* Batch Id: [205075] ,total : [3] , normal : [1] , filter :[2] , Time : 2018-07-08 18:28:47:730
* Start : [mysql-bin.000278:160112632:1531045727000(2018-07-08 18:28:47)]
* End : [mysql-bin.000278:160112954:1531045727000(2018-07-08 18:28:47)]
****************************************************

****************************************************
* Batch Id: [205076] ,total : [4] , normal : [2] , filter :[2] , Time : 2018-07-08 18:28:47:747
* Start : [mysql-bin.000278:160113025:1531045727000(2018-07-08 18:28:47)]
* End : [mysql-bin.000278:160113496:1531045727000(2018-07-08 18:28:47)]
**************************************************** 

row_load日志

 ****************************************************
* status : successed  , time : 2018-07-08 18:28:47:254 *
* Identity : Identity[channelId=15,pipelineId=41,processId=170574222] *
* total Data : [2] , success Data : [2] , failed Data : [0] , Interrupt : [false]
****************************************************

****************************************************
* status : successed  , time : 2018-07-08 18:28:47:416 *
* Identity : Identity[channelId=15,pipelineId=41,processId=170574223] *
* total Data : [2] , success Data : [2] , failed Data : [0] , Interrupt : [false]
****************************************************

****************************************************
* status : successed  , time : 2018-07-08 18:28:47:468 *
* Identity : Identity[channelId=15,pipelineId=41,processId=170574224] *
* total Data : [3] , success Data : [3] , failed Data : [0] , Interrupt : [false]
****************************************************

****************************************************
* status : successed  , time : 2018-07-08 18:28:47:628 *
* Identity : Identity[channelId=15,pipelineId=41,processId=170574225] *
* total Data : [1] , success Data : [1] , failed Data : [0] , Interrupt : [false]
****************************************************

****************************************************
* status : successed  , time : 2018-07-08 18:28:47:699 *
* Identity : Identity[channelId=15,pipelineId=41,processId=170574226] *
* total Data : [2] , success Data : [2] , failed Data : [0] , Interrupt : [false]
****************************************************

****************************************************
* status : successed  , time : 2018-07-08 18:28:47:749 *
* Identity : Identity[channelId=15,pipelineId=41,processId=170574227] *
* total Data : [1] , success Data : [1] , failed Data : [0] , Interrupt : [false]
****************************************************

****************************************************
* status : successed  , time : 2018-07-08 18:28:47:783 *
* Identity : Identity[channelId=15,pipelineId=41,processId=170574228] *
* total Data : [2] , success Data : [2] , failed Data : [0] , Interrupt : [false]
**************************************************** 
deep011 commented 6 years ago

另外一个很奇怪的现象,每次出现这个问题,SelectTask中的batchBuffer都是满的,新的batchTermin放入batchBuffer都要等将近60秒,我加了日志,打印出来是:

2018-07-08 18:29:48.244 [pipelineId = 41,taskName = ProcessSelect] WARN  com.alibaba.otter.node.etl.select.SelectTask - batchTermin put into the batchBuffer(len:1) used too long time(57146), pipelineId: 41, beginTimeBatchBufferPut: 1531045731098, processId: 170574276

可见,加入batchBuffer等了57秒。 按道理来说,batchBuffer不应该会被塞满的,因为现存的processId的数量不会超过pipeline的并行度(我们配置的10),而batchBuffer的容量代码里写死了50。

deep011 commented 6 years ago

另外,我们日志中频繁出现这样的告警:process is not in order, please check processId:989878 otter的代码如下图: image 感觉这里会有些并发问题。

agapple commented 6 years ago

你们是自己改过代码 ?

agapple commented 6 years ago

batchBuffer会加入一下空select数据,快速ack,有可能大小是超过process数量

deep011 commented 6 years ago

代码没有改过的 嗯,确实空的batch会放入batchBuffer。

deep011 commented 6 years ago

又打了一点日志,发现select阶段等待的termin信号在load阶段没有创建

image 上图是select机器的日志,显示 processId = 179019702 的termin信号等待超时

image 上图是load机器的日志,对每个termin的创建,我都打了日志,发现 processId = 179019702 的termin信号确实没有创建

load阶段,我加日志的地方在NormalTerminProcess.java文件中processDelete()函数中,如下图 image

wuqiu-ai commented 5 years ago

又打了一点日志,发现select阶段等待的termin信号在load阶段没有创建

image 上图是select机器的日志,显示 processId = 179019702 的termin信号等待超时

image 上图是load机器的日志,对每个termin的创建,我都打了日志,发现 processId = 179019702 的termin信号确实没有创建

load阶段,我加日志的地方在NormalTerminProcess.java文件中processDelete()函数中,如下图 image

这个最终是怎么解决的呢?我的现象环境也出现了这个问题

Fernandoize commented 3 years ago

这个问题有进展吗?我也遇到了类似的问题