apache / dubbo

The java implementation of Apache Dubbo. An RPC and microservice framework.
https://dubbo.apache.org/
Apache License 2.0
40.54k stars 26.44k forks source link

Dubbo调用超时,服务端历史统计的处理耗时很短且找不到对应超时上下文的超时日志 #1784

Closed Jaskey closed 4 years ago

Jaskey commented 6 years ago

A调用服务B,超时时间1秒,从某些请求来看超时了两次,重试了两次,第三次成功了,耗时仅4ms。

客户端有类似日志:

2018-05-11 00:00:14.741 [T_Common_DispatchExecutor_43] WARN  com.alibaba.dubbo.rpc.cluster.support.FailoverClusterInvoker -  [DUBBO] Although retry the method push in the service com.oppo.push.open.basic.api.service.BroadcastService was successful by the provider 10.12.
26.124:9000, but there have been failed providers [10.12.26.137:9000] (1/6) from the registry 10.12.26.154:2181 on the consumer 10.12.26.102 using the dubbo version 2.5.3. Last error is: Invoke remote method timeout. method: push, provider: dubbo://10.12.26.137:9000/com
.oppo.push.open.basic.api.service.BroadcastService?accesslog=false&anyhost=true&application=push-open-platform-gateway-server&check=false&default.delay=-1&default.service.filter=-exception&default.timeout=1000&delay=-1&dubbo=2.5.3&interface=com.oppo.push.open.basic.api.
service.BroadcastService&loadbalance=random&logger=slf4j&methods=revokePushMessage,revokeSmsMessage,push&pid=31597&push.retries=2&revision=3.0.2-SNAPSHOT&side=consumer&timestamp=1525937532890&transporter=netty, cause: Waiting server-side response timeout by scan timer. 
start time: 2018-05-11 00:00:13.710, end time: 2018-05-11 00:00:14.738, client elapsed: 0 ms, server elapsed: 1028 ms, timeout: 1000 ms, request: Request [id=2395465, version=2.0.0, twoway=true, event=false, broken=false, data=RpcInvocation [methodName=push, parameterTy
pes=[class com.oppo.push.open.basic.api.domain.PushTask], arguments=[PushTask[taskId=5af46c8da609e97b6dc17380, messageId=5af4655dad99497f2345ccda, messageCategory=3, messageType=1, taskStatus=5, target=PushTarget{id='5af46c8da609e97b6dc17381', type=REGISTRATION_ID, valu
e='6187cd138ad2dbceee7f044c2409d4f1;ab5721987d21dcf5e863b58e165f144a;827f4ff3b5d31cce9e441b045eb985c9;e17854797bb4f0e7b14178cbde596c88;CN_647982d13860dde6876cdf0e37165c76;93624a57d555510bcf27cda47d5a06f9;eb3ebc3a53b12ef304de95b3e1852c60;5b752dc47cbca9721239885a062814c2;
c69bd502aee453f0b78517bc9cb4548d;cc35de8b5edbff28dbaf55e4f60ad20a;CN_1239d29db4f5db54cc1bef51df3f3acc;CN_f6fd0448d1c6281da6ffa50df4d26f38;CN_2c2dc12e3c567ad81a00ca9cdca564e3;CN_d92b53c07de3e70883552295ece900f1;CN_b1cedf2537fd52e91fa29b5fb3c4eb47;525fb9de1d18e21c15479a0a
7f59b163;CN_debc0d6366f27dea1a45862020687f1a;d5bbb9bae7102ad0d69b3aeb138c4ff4;aac05d1a20d073343917c68798f9be2f;817898207275b2dbeefb6e9398f06b50;203e1b8c0de742605ecb166b6a92d9ce;8c6a4f0925ba3a1d92e3640d00507ade;fc3e47ab8ca527d964fae84ab5db00d7;d9c2ecf8051093bf7eb7818bbe5
5ec8f;c7894071806e5a6540a5380b1763e91b;b56b7a3e247258034876b9a8522b0dab;8b17b0c449bffa08407a4401f33b988f;789ca011db0844d01932fb8da2553a03;735e6aae28a3f2d5752e65f277bea9c3;e2cfca07e576a0285ab8f50dbaa3a947;8428d377795b069d9dfe033e880ed627;0939eab473c0311b5dd5567f5d191487;
ec2ff892fbc28fdf0008d532a2c48560;e3df5c83de7f985e73d4e58d4d15225d;CN_c5af7007e71c23771b29eb0713cde7e2;ef91ece8b7e44a13fb31015f8f20b25b;CN_57b78f69e35ce1f8c1fcb27e36758fca;CN_d58cb83cdd32f108f84ff6eae662712f;CN_5564f22ea67f3391fc0f37b23d68ade0;e2b842e3aea81c7e7784fae318e
bfb8a;2315554dca29eac7100d823f262c2b6f;18c0676bc2300630bad171d72353de0b;116c0728d4d8bed6aa80ba78a77620c1;11c37c408a7c88cf80e7bca695228c05;643413c1adad5e72df94594017797078;CN_c8dedeeeacdc24e9194386929e5367a7;CN_77a3fc33b094c592bad79d1aa95761af;4b9ceaaac2fbca88c01c3976c4a
c7849;0a1c506ef41acca7bf4c01a758ece6cb;66c0c30821c399e06bb63cfd630b2016;2a5efb3ee633035f5d3700dc744b97d2;07348c108d514773761a828499877ce4;4532d9d99307e6b3634fdfc06aecbb09;6aae440b59761d2ed4e425206f276dc7', effectiveValue='null', unsubscribeValue='null', deviceId='null'}
, appId=nJNKY574F2C4TWrAqYobsa3X, creator=null, messageTitle=null, createTime=Fri May 11 00:00:13 CST 2018, updateTime=Fri May 11 00:00:13 CST 2018], finishTime=null], extFields={}]], attachments={path=com.oppo.push.open.basic.api.service.BroadcastService, interface=com
.oppo.push.open.basic.api.service.BroadcastService, timeout=1000, version=0.0.0}]], channel: /10.12.26.102:49081 -> /10.12.26.137:9000, dubbo version: 2.5.3, current host: 10.12.26.102
com.alibaba.dubbo.rpc.RpcException: Invoke remote method timeout. method: push, provider: dubbo://10.12.26.137:9000/com.oppo.push.open.basic.api.service.BroadcastService?accesslog=false&anyhost=true&application=push-open-platform-gateway-server&check=false&default.delay
=-1&default.service.filter=-exception&default.timeout=1000&delay=-1&dubbo=2.5.3&interface=com.oppo.push.open.basic.api.service.BroadcastService&loadbalance=random&logger=slf4j&methods=revokePushMessage,revokeSmsMessage,push&pid=31597&push.retries=2&revision=3.0.2-SNAPSH
OT&side=consumer&timestamp=1525937532890&transporter=netty, cause: Waiting server-side response timeout by scan timer. start time: 2018-05-11 00:00:13.710, end time: 2018-05-11 00:00:14.738, client elapsed: 0 ms, server elapsed: 1028 ms, timeout: 1000 ms, request: Reque
st [id=2395465, version=2.0.0, twoway=true, event=false, broken=false, data=RpcInvocation [methodName=push, parameterTypes=[class com.oppo.push.open.basic.api.domain.PushTask], arguments=[PushTask[taskId=5af46c8da609e97b6dc17380, messageId=5af4655dad99497f2345ccda, mess
ageCategory=3, messageType=1, taskStatus=5, target=PushTarget{id='5af46c8da609e97b6dc17381', type=REGISTRATION_ID, value='6187cd138ad2dbceee7f044c2409d4f1;ab5721987d21dcf5e863b58e165f144a;827f4ff3b5d31cce9e441b045eb985c9;e17854797bb4f0e7b14178cbde596c88;CN_647982d13860d
de6876cdf0e37165c76;93624a57d555510bcf27cda47d5a06f9;eb3ebc3a53b12ef304de95b3e1852c60;5b752dc47cbca9721239885a062814c2;c69bd502aee453f0b78517bc9cb4548d;cc35de8b5edbff28dbaf55e4f60ad20a;CN_1239d29db4f5db54cc1bef51df3f3acc;CN_f6fd0448d1c6281da6ffa50df4d26f38;CN_2c2dc12e3c
567ad81a00ca9cdca564e3;CN_d92b53c07de3e70883552295ece900f1;CN_b1cedf2537fd52e91fa29b5fb3c4eb47;525fb9de1d18e21c15479a0a7f59b163;CN_debc0d6366f27dea1a45862020687f1a;d5bbb9bae7102ad0d69b3aeb138c4ff4;aac05d1a20d073343917c68798f9be2f;817898207275b2dbeefb6e9398f06b50;203e1b8
c0de742605ecb166b6a92d9ce;8c6a4f0925ba3a1d92e3640d00507ade;fc3e47ab8ca527d964fae84ab5db00d7;d9c2ecf8051093bf7eb7818bbe55ec8f;c7894071806e5a6540a5380b1763e91b;b56b7a3e247258034876b9a8522b0dab;8b17b0c449bffa08407a4401f33b988f;789ca011db0844d01932fb8da2553a03;735e6aae28a3f
2d5752e65f277bea9c3;e2cfca07e576a0285ab8f50dbaa3a947;8428d377795b069d9dfe033e880ed627;0939eab473c0311b5dd5567f5d191487;ec2ff892fbc28fdf0008d532a2c48560;e3df5c83de7f985e73d4e58d4d15225d;CN_c5af7007e71c23771b29eb0713cde7e2;ef91ece8b7e44a13fb31015f8f20b25b;CN_57b78f69e35ce
1f8c1fcb27e36758fca;CN_d58cb83cdd32f108f84ff6eae662712f;CN_5564f22ea67f3391fc0f37b23d68ade0;e2b842e3aea81c7e7784fae318ebfb8a;2315554dca29eac7100d823f262c2b6f;18c0676bc2300630bad171d72353de0b;116c0728d4d8bed6aa80ba78a77620c1;11c37c408a7c88cf80e7bca695228c05;643413c1adad5
e72df94594017797078;CN_c8dedeeeacdc24e9194386929e5367a7;CN_77a3fc33b094c592bad79d1aa95761af;4b9ceaaac2fbca88c01c3976c4ac7849;0a1c506ef41acca7bf4c01a758ece6cb;66c0c30821c399e06bb63cfd630b2016;2a5efb3ee633035f5d3700dc744b97d2;07348c108d514773761a828499877ce4;4532d9d99307e
6b3634fdfc06aecbb09;6aae440b59761d2ed4e425206f276dc7', effectiveValue='null', unsubscribeValue='null', deviceId='null'}, appId=nJNKY574F2C4TWrAqYobsa3X, creator=null, messageTitle=null, createTime=Fri May 11 00:00:13 CST 2018, updateTime=Fri May 11 00:00:13 CST 2018], f
inishTime=null], extFields={}]], attachments={path=com.oppo.push.open.basic.api.service.BroadcastService, interface=com.oppo.push.open.basic.api.service.BroadcastService, timeout=1000, version=0.0.0}]], channel: /10.12.26.102:49081 -> /10.12.26.137:9000
        at com.alibaba.dubbo.rpc.protocol.dubbo.DubboInvoker.doInvoke(DubboInvoker.java:99)
        at com.alibaba.dubbo.rpc.protocol.AbstractInvoker.invoke(AbstractInvoker.java:144)
        at com.alibaba.dubbo.rpc.listener.ListenerInvokerWrapper.invoke(ListenerInvokerWrapper.java:74)
        at com.alibaba.dubbo.rpc.protocol.dubbo.filter.FutureFilter.invoke(FutureFilter.java:53)
        at com.alibaba.dubbo.rpc.protocol.ProtocolFilterWrapper$1.invoke(ProtocolFilterWrapper.java:91)
        at com.alibaba.dubbo.monitor.support.MonitorFilter.invoke(MonitorFilter.java:75)
        at com.alibaba.dubbo.rpc.protocol.ProtocolFilterWrapper$1.invoke(ProtocolFilterWrapper.java:91)
        at com.oppo.trace.dubbo.DubboConsumerTraceFilter.invoke(DubboConsumerTraceFilter.java:37)
        at com.alibaba.dubbo.rpc.protocol.ProtocolFilterWrapper$1.invoke(ProtocolFilterWrapper.java:91)
        at com.alibaba.dubbo.rpc.filter.ConsumerContextFilter.invoke(ConsumerContextFilter.java:48)
        at com.alibaba.dubbo.rpc.protocol.ProtocolFilterWrapper$1.invoke(ProtocolFilterWrapper.java:91)
        at com.alibaba.dubbo.rpc.protocol.InvokerWrapper.invoke(InvokerWrapper.java:53)
        at com.alibaba.dubbo.rpc.cluster.support.FailoverClusterInvoker.doInvoke(FailoverClusterInvoker.java:77)
        at com.alibaba.dubbo.rpc.cluster.support.AbstractClusterInvoker.invoke(AbstractClusterInvoker.java:227)
        at com.alibaba.dubbo.rpc.cluster.support.wrapper.MockClusterInvoker.invoke(MockClusterInvoker.java:72)
        at com.alibaba.dubbo.rpc.proxy.InvokerInvocationHandler.invoke(InvokerInvocationHandler.java:52)
        at com.alibaba.dubbo.common.bytecode.proxy0.push(proxy0.java)
        at com.oppo.push.open.platform.gateway.action.notification.NotificationBroadcastAction.doExecute(NotificationBroadcastAction.java:105)
        at com.oppo.push.open.platform.gateway.action.AbstractAction.execute(AbstractAction.java:51)
        at com.oppo.push.httpframework.Action.ActionCallable$1.exec(ActionCallable.java:39)
        at com.oppo.push.httpframework.Action.ActionCallable$1.exec(ActionCallable.java:36)
        at com.oppo.push.httpframework.common.TraceTemplate.rootExec(TraceTemplate.java:24)
        at com.oppo.push.httpframework.Action.ActionCallable.run(ActionCallable.java:36)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: com.alibaba.dubbo.remoting.TimeoutException: Waiting server-side response timeout by scan timer. start time: 2018-05-11 00:00:13.710, end time: 2018-05-11 00:00:14.738, client elapsed: 0 ms, server elapsed: 1028 ms, timeout: 1000 ms, request: Request [id=2395
465, version=2.0.0, twoway=true, event=false, broken=false, data=RpcInvocation [methodName=push, parameterTypes=[class com.oppo.push.open.basic.api.domain.PushTask], arguments=[PushTask[taskId=5af46c8da609e97b6dc17380, messageId=5af4655dad99497f2345ccda, messageCategory
=3, messageType=1, taskStatus=5, target=PushTarget{id='5af46c8da609e97b6dc17381', type=REGISTRATION_ID, value='6187cd138ad2dbceee7f044c2409d4f1;ab5721987d21dcf5e863b58e165f144a;827f4ff3b5d31cce9e441b045eb985c9;e17854797bb4f0e7b14178cbde596c88;CN_647982d13860dde6876cdf0e
37165c76;93624a57d555510bcf27cda47d5a06f9;eb3ebc3a53b12ef304de95b3e1852c60;5b752dc47cbca9721239885a062814c2;c69bd502aee453f0b78517bc9cb4548d;cc35de8b5edbff28dbaf55e4f60ad20a;CN_1239d29db4f5db54cc1bef51df3f3acc;CN_f6fd0448d1c6281da6ffa50df4d26f38;CN_2c2dc12e3c567ad81a00c
a9cdca564e3;CN_d92b53c07de3e70883552295ece900f1;CN_b1cedf2537fd52e91fa29b5fb3c4eb47;525fb9de1d18e21c15479a0a7f59b163;CN_debc0d6366f27dea1a45862020687f1a;d5bbb9bae7102ad0d69b3aeb138c4ff4;aac05d1a20d073343917c68798f9be2f;817898207275b2dbeefb6e9398f06b50;203e1b8c0de742605e
cb166b6a92d9ce;8c6a4f0925ba3a1d92e3640d00507ade;fc3e47ab8ca527d964fae84ab5db00d7;d9c2ecf8051093bf7eb7818bbe55ec8f;c7894071806e5a6540a5380b1763e91b;b56b7a3e247258034876b9a8522b0dab;8b17b0c449bffa08407a4401f33b988f;789ca011db0844d01932fb8da2553a03;735e6aae28a3f2d5752e65f2
77bea9c3;e2cfca07e576a0285ab8f50dbaa3a947;8428d377795b069d9dfe033e880ed627;0939eab473c0311b5dd5567f5d191487;ec2ff892fbc28fdf0008d532a2c48560;e3df5c83de7f985e73d4e58d4d15225d;CN_c5af7007e71c23771b29eb0713cde7e2;ef91ece8b7e44a13fb31015f8f20b25b;CN_57b78f69e35ce1f8c1fcb27e
36758fca;CN_d58cb83cdd32f108f84ff6eae662712f;CN_5564f22ea67f3391fc0f37b23d68ade0;e2b842e3aea81c7e7784fae318ebfb8a;2315554dca29eac7100d823f262c2b6f;18c0676bc2300630bad171d72353de0b;116c0728d4d8bed6aa80ba78a77620c1;11c37c408a7c88cf80e7bca695228c05;643413c1adad5e72df945940
17797078;CN_c8dedeeeacdc24e9194386929e5367a7;CN_77a3fc33b094c592bad79d1aa95761af;4b9ceaaac2fbca88c01c3976c4ac7849;0a1c506ef41acca7bf4c01a758ece6cb;66c0c30821c399e06bb63cfd630b2016;2a5efb3ee633035f5d3700dc744b97d2;07348c108d514773761a828499877ce4;4532d9d99307e6b3634fdfc0
6aecbb09;6aae440b59761d2ed4e425206f276dc7', effectiveValue='null', unsubscribeValue='null', deviceId='null'}, appId=nJNKY574F2C4TWrAqYobsa3X, creator=null, messageTitle=null, createTime=Fri May 11 00:00:13 CST 2018, updateTime=Fri May 11 00:00:13 CST 2018], finishTime=n
ull], extFields={}]], attachments={path=com.oppo.push.open.basic.api.service.BroadcastService, interface=com.oppo.push.open.basic.api.service.BroadcastService, timeout=1000, version=0.0.0}]], channel: /10.12.26.102:49081 -> /10.12.26.137:9000
        at com.alibaba.dubbo.remoting.exchange.support.DefaultFuture.returnFromResponse(DefaultFuture.java:188)
        at com.alibaba.dubbo.remoting.exchange.support.DefaultFuture.get(DefaultFuture.java:110)
        at com.alibaba.dubbo.remoting.exchange.support.DefaultFuture.get(DefaultFuture.java:84)
        at com.alibaba.dubbo.rpc.protocol.dubbo.DubboInvoker.doInvoke(DubboInvoker.java:96)
        ... 27 common frames omitted

有以下两个奇怪的地方:

  1. 从报超时的服务端上看,找不到对应的超时的warn日志
  2. 从我们内部服务端统计的耗时来看,此接口从来没有大于100ms的。

以下是我的分析:

猜测和dubbo 服务端进入业务线程前的排队有关,但是我们没有设置queue的长度,所以默认应该是不排队的,线程池耗尽应该抛出的异常不是timeout。

即使排队了,但是服务端没有找到此上下文的超时日志,感觉像完全没有收到此请求?

会不会是直接网络上丢了包?如果网络上传输就失败了,异常会是超时么?此请求是一个对象,转换为JSON后有几K。

求解答

feelwing1314 commented 6 years ago

"从报超时的服务端上看,找不到对应的超时的warn日志" 。1.有没有可能dubbo的com.alibaba.dubbo.rpc.filter这个目录的日志设置成了error级别? 2.服务端TimeoutFilter的timeout设置在provider侧,而不是consumer侧(),默认是Integer.MAX_VALUE。你consumer侧设置的1s没有用。

Jaskey commented 6 years ago

@feelwing1314 设置的超时就是provider设置的,其他超时是有warn日志的,所以不可能是error级别的问题

li-keli commented 6 years ago

我这里也出现了类似的问题, dubbo version 2.6.0 调用某个服务的方法,马上(几微秒)返回 Invoke remote method timeout.,重复尝试几次,有的时候又可以正常调用,有的时候又返回timeout。

provider 模块已经配置了timeout设置。

eoc2015 commented 6 years ago

我也遇到一样的问题~ 如有答案,万分感谢

Jaskey commented 6 years ago

跪求权威解释

lovepoem commented 6 years ago

会不会是直接网络上丢了包?如果网络上传输就失败了,异常会是超时么

Network problems can lead to timeout. Although the server side is normal,consumer will appear timeout

Jaskey commented 6 years ago

@lovepoem 根据日志 "client elapsed: 0 ms, server elapsed: 1028 ms, timeout: 1000 ms", 可以推断,肯定不是客户端到服务端这一个方向的问题,应该是服务端回客户端的问题,但是服务端实际上耗时很短,所以如果真的是网络问题,那就是回程的问题。但是从来没发现去程有类似的问题啊,所以大概率和网络有关。

是否有可能和编解码这些地方有关系?

li-keli commented 6 years ago

我这边曾今也出现类似的超时问题,两边的业务代码都没有异常,从远程调用到收到异常间隔也就几毫秒。但是却抛出超时异常。

后来就给服务方和消费方都加上timeout,就没有这个问题了。

drluorose commented 6 years ago

我也遇到同样问题,com.alibaba.dubbo.rpc.RpcException: Invoke remote method timeout. method: takeAppAd, provider: dubbo://192.168.2.13:20805/com.douyu.wsd.adx.exchange.rpc.api.virtualaccount.AdLuaRService?actives=0&alive=60000&anyhost=true&application=adx-gateway&async=false&check=false&connections=4&corethreads=0&dispatcher=message&dubbo=2.6.0&generic=false&group=pro&interface=com.douyu.wsd.adx.exchange.rpc.api.virtualaccount.AdLuaRService&iothreads=33&keep.alive=true&lazy=false&loadbalance=roundrobin&logger=slf4j&methods=takeWebAd,isWebLibertyUser,takeAppAd,isAppLibertyUser&monitor=dubbo%3A%2F%2F10.11.0.55%3A2181%2Fcom.alibaba.dubbo.registry.RegistryService%3Fapplication%3Dadx-gateway%26backup%3D10.11.0.22%3A2181%2C10.11.0.26%3A2181%26dubbo%3D2.6.0%26logger%3Dslf4j%26organization%3Dwsd%26owner%3Dwsd-java%26pid%3D1210993%26protocol%3Dregistry%26refer%3Ddubbo%253D2.6.0%2526interface%253Dcom.alibaba.dubbo.monitor.MonitorService%2526pid%253D1210993%2526timestamp%253D1532628089853%26registry%3Dzookeeper%26timestamp%3D1532628089837&optimizer=com.douyu.adx.exchange.dubbo.internal.SerializationOptimizerImpl&organization=wsd&owner=wsd-java&pid=1210993&protocol=dubbo&queues=0&register.ip=192.168.2.9&remote.timestamp=1532629098461&retries=1&revision=0.0.1-SNAPSHOT&sent=false&serialization=kryo&service.filter=exceptionFilter&side=consumer&threadname=AdxRServiceConsumer&threads=2147483647&timeout=900&timestamp=1532628089789&version=1.0.0, cause: Waiting server-side response timeout by scan timer. start time: 2018-07-27 17:56:13.067, end time: 2018-07-27 17:56:13.984, client elapsed: 0 ms, server elapsed: 917 ms, timeout: 900 ms, request: Request [id=82601558, version=2.0.0, twoway=true, event=false, broken=false, data=RpcInvocation [methodName=takeAppAd, parameterTypes=[class com.douyu.wsd.framework.common.domain.StdRequest, interface java.util.Map], arguments=[StdRequest{tid='d1cd7b19dfa1453b94c8eb92d98659a3', ts=1532685373066, data={"app":"{\"aname\":\"斗鱼直播\",\"pname\":\"air.tv.douyu.android\"}","mdid":"phone","uid":"128636735","posid":"800039","client_sys":"android","cate1":"15","cate2":"270","imei":"868060039348334","chanid":"32","device":"{\"addid\":\"69a88d50cbc41a75\",\"devtype\":\"0\",\"h\":\"2160\",\"idfa\":\"\",\"imei\":\"868060039348334\",\"mac\":\"02:00:00:00:00:00\",\"mfrs\":\"OPPO\",\"model\":\"OPPO-R11st\",\"nt\":\"1\",\"op\":\"1\",\"os\":\"Android\",\"osv\":\"7.1.1\",\"ua\":\"Douyu_Android\",\"w\":\"1080\"}","roomid":"100","token":"128636735_12_c5f2a7e1f5f81815_1_69874863"}}, {content-length=599, cookie=acf_uid=128636735; acf_did=eddb5774190e008787999c0870805111, auth=dd56178541ba50125972f066cad47bd1, channel=32, x-via=1.1 PSzjhzsdyd6mw47:1 (Cdn Cache Server V2.0), x-forwarded-for=218.205.75.47, x-real-ip=218.205.75.47, requestTime=1532685373066, user-device=ZWRkYjU3NzQxOTBlMDA4Nzg3OTk5YzA4NzA4MDUxMTF8djQuNi4w, clientIp=39.182.109.142, host=rtbapi.douyucdn.cn, connection=close, content-type=application/x-www-form-urlencoded, time=1532685372, cdn-src-ip=39.182.109.142, aid=android1, accept-encoding=gzip, user-agent=android/4.6.0 (android 7.1.1; ; OPPO+R11st)}], attachments={path=com.douyu.wsd.adx.exchange.rpc.api.virtualaccount.AdLuaRService, interface=com.douyu.wsd.adx.exchange.rpc.api.virtualaccount.AdLuaRService, version=1.0.0, timeout=900, sw3=104.22761.15326853730678968|0|104|104|-50|#com.douyu.wsd.adx.exchange.rpc.api.virtualaccount.AdLuaRService.takeAppAd(StdRequest,Map)|#com.douyu.wsd.adx.exchange.rpc.api.virtualaccount.AdLuaRService.takeAppAd(StdRequest,Map)|104.22761.15326853730678969, group=pro}]], channel: /192.168.2.9:41504 -> /192.168.2.13:20805

at com.alibaba.dubbo.rpc.protocol.dubbo.DubboInvoker.doInvoke(DubboInvoker.java:98)

at com.alibaba.dubbo.rpc.protocol.AbstractInvoker.invoke(AbstractInvoker.java:142)

at com.alibaba.boot.dubbo.listener.ConsumerInvokeStaticsFilter.invoke(ConsumerInvokeStaticsFilter.java:32)

at com.alibaba.dubbo.rpc.protocol.ProtocolFilterWrapper$1.invoke(ProtocolFilterWrapper.java:68)

at com.alibaba.dubbo.monitor.support.MonitorFilter.invoke$original$tDp0wZPU(MonitorFilter.java:64)

drluorose commented 6 years ago

服务端响应时间最长时间是345ms

drluorose commented 6 years ago

配置信息: dubboProtocol: name: dubbo port: 20805 dispatcher: message serialization: kryo keepAlive: true optimizer: com.douyu.adx.exchange.dubbo.internal.SerializationOptimizerImpl threads: 1000 queues: 2000 Service:@Service( interfaceClass = AdLuaRService.class, protocol = PROTOCOL_DUBBO, retries = 0, parameters = {"threadname", "AdLuaRService"}, filter = "exceptionFilter", connections = 10, timeout = 800 )

drluorose commented 6 years ago

如果是网络原因,有哪些方能能够证明?

Jaskey commented 6 years ago

@lovepoem @chickenlj

It seems most of the users face the same problem , can you provide some advice?

kimmking commented 6 years ago

provider的业务方法执行的时间PT=5ms,但是 consumer调用的时间CT=consumer本地代码执行时间CLT(生成stub、两次编解码) +网络请求时间NRT1 +网络响应时间NRT2 +provider本地的dubbo代码执行时间PLT(生成skeleton、两次编解码) +PT

这种情况,建议consumer端设置一个较长的超时比如 Provider=1s,Consumer=5s, 这样如果provider超时,Consumer能拿到异常,也不用等5s, 如果Provider没有超时,Consumer有足够的时间等到拿到结果,避免P正常,C超时。

Jaskey commented 6 years ago

@kimmking , 但是如果网络正常的情况下,现在这种很奇怪的场景就是损失在 CLT(stub+编解码)?什么情况下会损失几秒

drluorose commented 6 years ago

image 这个是对应的图标,在并发高的时候,consumer的时间会有很大的波动,而且provider的时间不变!问一下这种情况能从哪些方面来优化?

fingthinking commented 6 years ago

这种情况是不是大多数出现在服务发版新实例上线的时候?也常出现在 response很小的情况,如果没有设置 tcpNoDelay = true,会出现这种现象。

Jaskey commented 6 years ago

@fingthinking 设置了这个现象就解决了?

jasonjoo2010 commented 6 years ago

@drluorose 你这种情况,可以设置订阅端的connections属性,给访问压力大的service独立一条或几条长连接出来避免争抢,否则可能会在密集时被限流。

另外需要考虑的是,压力上来的时候,consumer端的load、cpu.switches/irq/softirq/idle的情况如何

fingthinking commented 6 years ago

@fingthinking 设置了这个现象就解决了?

加了这个之后,会有所改善,另外建议设置一下 netty3 的 backlog 参数,netty3默认是用的 jdk 默认设置为50,netty4默认读取/proc/sys/net/core/somaxconn配置

cvictory commented 6 years ago

@Jaskey is it ok now?

FS1360472174 commented 6 years ago

遇到了这个情况,服务端处理的很快,但是回客户端的时候是几s 后了,抓包也发现是这种情况

yanlun0323 commented 6 years ago

线上环境dubbo调用超时:Caused by: com.alibaba.dubbo.remoting.TimeoutException: Waiting server-side response timeout by scan timer. start time: 2018-10-18 12:23:28.116, end time: 2018-10-18 12:33:28.130, client elapsed: 0 ms, server elapsed: 600014 ms, timeout: 600000 ms, request: Request;请问这个和cached线程池设置200有没有关系呢?

cvictory commented 6 years ago

@FS1360472174 @Jaskey How can I reproduce this issue in my local machine? can you upload a demo ?

chenjianboKeyboard commented 6 years ago

我们也遇到过,这种奇怪的现象曾经也怀疑过各种 IO情况,但最后定位是线程池不足,请求压根无法进入dubbo线程池,dubbo框架也没有打印异常,当然请求入参出参也没打印,客户的表现就是超时。

gangxing commented 6 years ago

遇到了这个情况,服务端处理的很快,但是回客户端的时候是几s 后了,抓包也发现是这种情况 @FS1360472174 你解决了吗 我最近也遇到相同的问题

WinstonGao commented 6 years ago

看来同样的问题大家都碰到了很多啊

iorilchen commented 5 years ago

有时是服务端接收消息延迟,有时是调用端接收结果延迟,最终都是超时

byronzoz commented 5 years ago

用链路追踪系统跟踪一下,看看RPC的时间分布:网络传输层or业务处理层

bert82503 commented 5 years ago

如果是网络原因,有哪些方能能够证明?

集成分布式链路追踪系统能快速定位是否是网络引起,对于偶发性问题使用tcpdump凭靠运气。对于queues: 2000,耗时可能在排队等待执行,不建议配置。connections = 10会使消费者应用至少创建10个cached线程池,我们前段时间在生产环境遇到,某提供者应用有5个服务设置了connections=50,对于单个提供者实例,每个消费者至少会创建50*5=250个cached线程池(每服务每连接,见DubboProtocol#refer)。

bert82503 commented 5 years ago

遇到了这个情况,服务端处理的很快,但是回客户端的时候是几s 后了,抓包也发现是这种情况

我们在生产环境(Kubernetes)通过链路追踪系统发现网络上耗时达几秒,通过抓包证实确实是网络引起。接下来就交给运维处理 @gangxing

bert82503 commented 5 years ago

如果超时问题发生比较频繁,在没有链路追踪系统辅助下,可以借助greys, arthas命令在线debug消费者端耗时,需要一步一步追踪看看最耗时的步骤,对于高并发系统需评估工具对于系统性能的影响。

bert82503 commented 5 years ago

我们在生产环境也遇到偶发性超时,最终定位到问题根源是提供者端Netty IO一直报ClassNotFoundException,因为暴露的API没有向前兼容:joy:,修复掉这个就正常了。

gosenme commented 5 years ago

@edwardlee03 怎么修复呢?

gosenme commented 5 years ago

我这边也遇到类似问题,在provider设置超时时间为2s,在consumer端监控调用栈耗时,发现偶尔有超时(一天时间有10几个请求左右),超时的时间点不规律。 apm监控到超时的时候有请求发到zookeeper,而且正好都超过2s

bert82503 commented 5 years ago

@edwardlee03 怎么修复呢?

@gosenme 根据ClassNotFoundException可知道哪个类,依赖哪个jar,然后升级提供者或消费者应用依赖的jar,使其不再报这个异常即可。Dubbo社区增加了Cache<ClassName>类加载异常缓存,避免重复并发同步地加载类信息。

indasoo commented 5 years ago

@Jaskey 请问服务端是否开启了系统自带的监控服务?注册中心客户端用的是哪个,是否有设置密码? 如果密码中有@符号,会遇到类似的问题。

huangxuzhi commented 5 years ago

如果超时问题发生比较频繁,在没有链路追踪系统辅助下,可以借助greys, arthas命令在线debug消费者端耗时,需要一步一步追踪看看最耗时的步骤,对于高并发系统需评估工具对于系统性能的影响。

想请教下具体怎么追踪呢?我最近生产环境也遇到这个问题了。 @edwardlee03

MrWangLong commented 5 years ago

烽火拿你这个问题问我我也是醉了。。。

bert82503 commented 5 years ago

如果超时问题发生比较频繁,在没有链路追踪系统辅助下,可以借助greys, arthas命令在线debug消费者端耗时,需要一步一步追踪看看最耗时的步骤,对于高并发系统需评估工具对于系统性能的影响。

想请教下具体怎么追踪呢?我最近生产环境也遇到这个问题了。 @edwardlee03

首先看工具官方文档(Greys-Java线上问题诊断工具PDFJava诊断利器Arthas),见快速入门里的安装和启动步骤,能很快上手。再不行就搜索引擎一搜不就有很多使用教程哦,切记搜索引擎是你的老师和朋友

bert82503 commented 5 years ago

线上环境dubbo调用超时:Caused by: com.alibaba.dubbo.remoting.TimeoutException: Waiting server-side response timeout by scan timer. start time: 2018-10-18 12:23:28.116, end time: 2018-10-18 12:33:28.130, client elapsed: 0 ms, server elapsed: 600014 ms, timeout: 600000 ms, request: Request;请问这个和cached线程池设置200有没有关系呢?

异常信息说”服务端处理超时“,说明提供者处理慢了,提示的很明显了。

bert82503 commented 5 years ago

我们也遇到过,这种奇怪的现象曾经也怀疑过各种 IO情况,但最后定位是线程池不足,请求压根无法进入dubbo线程池,dubbo框架也没有打印异常,当然请求入参出参也没打印,客户的表现就是超时。

提供者线程池满了,也会打印日志,一般异常都会输出相关日志,dubbo框架异常输出还可以的。或者你没找到相关日志

HungFoolishry commented 5 years ago

除了网络问题,还可能有其他原因么,我司也出现了这种短时抖动,服务端完全是没有延迟的报错

PKFresher commented 4 years ago

这个问题有解决方案了吗,我们在最近使用的过程中也遇到了 客户端调用超时,服务端完全正常,偶现. 大概率不是网络问题,我们对客户端压测,停止后 服务端服务正常, 客户端调用一直超时,然后过来几个小时之后就自动恢复了.现在每天会爆几千个这样的异常,概率不大.

bert82503 commented 4 years ago

@PKFresher dubbo有集成分布式链路追踪功能么?打开dubbo访问日志,主要看耗时在哪个阶段,然后再逐层剖析进去

indasoo commented 4 years ago

@PKFresher dubbo有集成分布式链路追踪功能么?打开dubbo访问日志,主要看耗时在哪个阶段,然后再逐层剖析进去

建议直接使用skywalking吧

sunshineboyayao commented 4 years ago

我们也遇到过,这种奇怪的现象曾经也怀疑过各种 IO情况,但最后定位是线程池不足,请求压根无法进入dubbo线程池,dubbo框架也没有打印异常,当然请求入参出参也没打印,客户的表现就是超时。

“最后定位是线程池不足”,这个是否是在服务端的日志中发现了Thread pool is EXHAUSTED! 故障呢?

shihuncl commented 3 years ago

我想问一下有没有人解决?我们生产也是偶尔的超时,链路监控没有上,希望有碰到过然后解决的出来解答一下。我们线上dubbo线程池满的报错会有,不是因为线程池满报的这个超时

shihuncl commented 3 years ago

线上的问题已经定位到是因为5分钟执行一次的定时任务操作大对象,导致过一段时间那个服务节点要执行GC。这个时候请求过来只能等GC结束导致dubbo请求超时。

bert82503 commented 3 years ago

除了网络问题,还可能有其他原因么,我司也出现了这种短时抖动,服务端完全是没有延迟的报错

可能的原因: 1.请求阻塞在工作者队列里排队,消耗了大部分时间; 2.API.jar升级引起类不兼容加载不到,频繁加载某一个类会引起阻塞,可以看下ClassLoader源码。dubbo新版本已修复这个问题。 3.较长时间的GC停顿