codingteam / horta-hell

Configurable XMPP bot written in Scala based on the Akka framework.
MIT License
15 stars 8 forks source link

No response from server #414

Closed ForNeVeR closed 7 years ago

ForNeVeR commented 7 years ago

These days, horta very often generates messages like this:

ERROR r.o.c.h.p.jabber.JabberProtocol - Cannot join room codingteam@conference.jabber.ru, retrying in 5 seconds
org.jivesoftware.smack.XMPPException: No response from server.
        at org.jivesoftware.smackx.muc.MultiUserChat.join(MultiUserChat.java:498) ~[horta-hell.jar:0.17.1]
        at org.jivesoftware.smackx.muc.MultiUserChat.join(MultiUserChat.java:392) ~[horta-hell.jar:0.17.1]
        at ru.org.codingteam.horta.protocol.jabber.JabberProtocol$$anonfun$receive$1$$anonfun$2.apply(JabberProtocol.scala:69) ~[horta-hell.jar:0.17.1]
        at ru.org.codingteam.horta.protocol.jabber.JabberProtocol$$anonfun$receive$1$$anonfun$2.apply(JabberProtocol.scala:63) ~[horta-hell.jar:0.17.1]
        at scala.util.Try$.apply(Try.scala:192) ~[horta-hell.jar:0.17.1]
        at ru.org.codingteam.horta.protocol.jabber.JabberProtocol$$anonfun$receive$1.applyOrElse(JabberProtocol.scala:63) ~[horta-hell.jar:0.17.1]
        at akka.actor.Actor$class.aroundReceive(Actor.scala:465) ~[horta-hell.jar:0.17.1]
        at ru.org.codingteam.horta.protocol.jabber.JabberProtocol.aroundReceive(JabberProtocol.scala:19) ~[horta-hell.jar:0.17.1]
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) [horta-hell.jar:0.17.1]
        at akka.actor.ActorCell.invoke(ActorCell.scala:487) [horta-hell.jar:0.17.1]
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) [horta-hell.jar:0.17.1]
        at akka.dispatch.Mailbox.run(Mailbox.scala:220) [horta-hell.jar:0.17.1]
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) [horta-hell.jar:0.17.1]
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [horta-hell.jar:0.17.1]
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [horta-hell.jar:0.17.1]
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [horta-hell.jar:0.17.1]
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [horta-hell.jar:0.17.1]
12:16:58.216 [CodingteamSystem-akka.actor.default-dispatcher-4] INFO  r.o.c.h.p.jabber.MucMessageHandler - Room codingteam@conference.jabber.ru stopped

After that it could drop the room and tries to reconnect without success. I suspect there's something on the server side: http://stackoverflow.com/questions/13290499/could-not-connect-to-the-xmpp-server-via-smack-no-response-from-server

Need to investigate.

ForNeVeR commented 7 years ago

It seems that we have timeout from join method of Smack room.

ForNeVeR commented 7 years ago

Something is still fucky.

ForNeVeR commented 7 years ago

Well, currently I suspect that's a concurrency issue. Probably Akka sometimes schedules various XMPP actions on different threads, and non-synchronized Smack does something wrongly here.

ForNeVeR commented 7 years ago

One possible solution is to pin JabberProtocol actor to concrete thread in the same manner I've done it in my branch for #408.

ForNeVeR commented 7 years ago

Tried that and it didn't helped. Investigate further.

ForNeVeR commented 7 years ago

Alright, probably #408 is a way to go out of these troubles, but first of all we need to identify the problem. Report back after that.

rexim commented 7 years ago

@ForNeVeR as far as I can understand, the environment is Linux docker container on Windows, right? Have you tried other environments? Is this problem reproducible everywhere?

ForNeVeR commented 7 years ago

I have reproduced the problem both on my Windows machine and inside of the Docker container running on Linux under Hyper-V on the same Windows machine. Currently I think that the problem is universal (but could be triggered more often if entering multiple crowded rooms at once).

It may be linked with our XMPP server (horta uses its own account on codingteam.org.ru); I haven't tried it with other servers, and will check that also.

ForNeVeR commented 7 years ago

Currently I think that MTU problems on my host machine were involved (MTU was too big, and that caused troubles on my desktop XMPP client as well; some another facilities were damaged), and these problems are/were applicable to my Docker instance as well. It seems that I've tuned my MTU, and now I can execute horta on my Windows machine without problems.

But dockerized horta still doesn't work for some reason. It starts, but gets reconnected every 2 minutes with the same message ("no response from server").

The proper MTU value for my host network is 1468. I've confirmed that the MTU value inside of horta container isn't right by running the following from the host (Windows) machine:

$ docker exec horta-hell  sh -c "ip a | grep mtu"
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
2: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1
3: gre0@NONE: <NOARP> mtu 1476 qdisc noop state DOWN group default qlen 1
4: gretap0@NONE: <BROADCAST,MULTICAST> mtu 1462 qdisc noop state DOWN group default qlen 1000
5: ip_vti0@NONE: <NOARP> mtu 1428 qdisc noop state DOWN group default qlen 1
6: ip6_vti0@NONE: <NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1
7: sit0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1
8: ip6tnl0@NONE: <NOARP> mtu 1452 qdisc noop state DOWN group default qlen 1
9: ip6gre0@NONE: <NOARP> mtu 1448 qdisc noop state DOWN group default qlen 1
35: eth0@if36: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default

(I believe that the last line shows the MTU value used.)

I've tuned MTU on docker daemon (by adding "mtu": 1468 to my Docker config) and can confirm that it was applied successfully:

$ docker exec horta-hell  sh -c "ip a | grep mtu"
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
2: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1
3: gre0@NONE: <NOARP> mtu 1476 qdisc noop state DOWN group default qlen 1
4: gretap0@NONE: <BROADCAST,MULTICAST> mtu 1462 qdisc noop state DOWN group default qlen 1000
5: ip_vti0@NONE: <NOARP> mtu 1428 qdisc noop state DOWN group default qlen 1
6: ip6_vti0@NONE: <NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1
7: sit0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1
8: ip6tnl0@NONE: <NOARP> mtu 1452 qdisc noop state DOWN group default qlen 1
9: ip6gre0@NONE: <NOARP> mtu 1448 qdisc noop state DOWN group default qlen 1
27: eth0@if28: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1468 qdisc noqueue state UP group default

Although dockerized horta still reconnects every 2 minutes.

@Minoru, @rexim, anybody: do you have any insight on how network / MTU works in Docker and in Linux / computer networks generally? Maybe I need to change it on the host Linux VM as well? Maybe MTU have nothing to do with the problem at all?

ForNeVeR commented 7 years ago

Upd: to make sure, I've checked what interface the container uses to connect to ctor:

$ # codingteam.org.ru IP address is 198.211.121.192
$ docker exec horta-hell sh -c "ip route get 198.211.121.192"
198.211.121.192 via 172.17.0.1 dev eth0  src 172.17.0.2
    cache

I think that means I got the interface right, and it really should use the proper MTU value.

Minoru commented 7 years ago

As far as I understand, packets larger than machine's MTU get fragmented—they aren't simply dropped. Try downloading and uploading something largish, like tens of kilobytes; if it works fine, I think we can drop MTU hypothesis (but don't know much about it beyond what a overheard from smarter people).

Where did you get the idea about MTU in the first place?

ForNeVeR commented 7 years ago

It was just a random guess to answer the question "what the hell is going wrong with my XMPP connection" — and it worked out!

Most of the time I wasn't able to connect to XMPP, and also I wasn't able to paste big pieces of text into SSH session, so I am currently convinced that I actually had MTU problems (I also tried some tricks with ping to get the maximal working size of MTU, and it was 1468 instead of default 1500). After the problems were fixed, my own XMPP connection problems were fixed as well. That leads me to the conclusion that horta's problems are caused by the same issue (because my network setup is definitely shitty and default setting of 1500 doesn't work well).

Regarding your statement: "packets larger than machine's MTU get fragmented—they aren't simply dropped": I believe that's not generally right. There's "don't fragment" flag in IP header, but I don't know how many applications use that (and whether that's application responsibility). Probably my network setup (I have two visible routers in chain, and god knows how many on the provider side) sometimes (?) does something weird with packets exceeding the MTU.

ForNeVeR commented 7 years ago

Try downloading and uploading something largish, like tens of kilobytes

Well, I've tried the following:

$ docker exec horta-hell sh -c "ping -s 1470 -c 1 codingteam.org.ru"
PING codingteam.org.ru (198.211.121.192): 1470 data bytes
1478 bytes from 198.211.121.192: icmp_seq=0 ttl=37 time=0.340 ms
--- codingteam.org.ru ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.340/0.340/0.340/0.000 ms
$ docker exec horta-hell sh -c "ping -s 1480 -c 1 codingteam.org.ru"
PING codingteam.org.ru (198.211.121.192): 1480 data bytes
--- codingteam.org.ru ping statistics ---
1 packets transmitted, 0 packets received, 100% packet loss

It seems that packets bigger than 1470 bytes can't pass through my network even without messing with "don't fragment" flag (maybe Debian's ping implementation have this flag by default, I don't know).

ForNeVeR commented 7 years ago

Alright, current status report:

I will be watching.

ForNeVeR commented 7 years ago

Nope, doesn't work. Will investigate.

ForNeVeR commented 7 years ago

Current hypothesis is that I failed everything.

Now testing Horta with the new socket factory disabled.