Closed ari closed 4 years ago
There is an exception when this happens
10:02:24.725 [bootique-http-86] WARN org.eclipse.jetty.io.ManagedSelector:510 - Fatal select() failure {}[]
java.io.IOException: Connection reset by peer
at java.base/sun.nio.ch.FileDispatcherImpl.close0(Native Method)
at java.base/sun.nio.ch.SocketDispatcher.close(SocketDispatcher.java:55)
at java.base/sun.nio.ch.SocketChannelImpl.kill(SocketChannelImpl.java:907)
at java.base/sun.nio.ch.SelectorImpl.processDeregisterQueue(SelectorImpl.java:267)
at java.base/sun.nio.ch.KQueueSelectorImpl.doSelect(KQueueSelectorImpl.java:138)
at java.base/sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:124)
at java.base/sun.nio.ch.SelectorImpl.select(SelectorImpl.java:141)
at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.select(ManagedSelector.java:472)
at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.produce(ManagedSelector.java:409)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produceTask(EatWhatYouKill.java:360)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:184)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)
at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:367)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:782)
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:914)
at java.base/java.lang.Thread.run(Thread.java:834)
It looks like java couldn't close the file descriptor properly.
Does any request work? Just asking in terms of reproducing the problem. E.g., if I set up Jetty and hit it with simple wget or curl requests will this trigger the problem in your experience or haven't you tried that?
Some observations:
Because it takes a while to make the problem happen we've been doing this:
# for i in {1..100}; do for j in {1..20}; do curl -L https://ourserver.com/something >/dev/null 2>&1& done && wait; done
That tends to make it happen within about 15 seconds. It doesn't seem to matter whether the request is causing jetty to return a file from disk or just something generated in memory.
We can make it happen easily enough, but we haven't tested:
I've reproduced this with jetty 9.4.24 on my 11.3 box.
The problem reproduces with both openjdk11 and openjdk12, but works correctly with openjdk13.
Thanks so much for the time you have spent on this. We were just about to try and create an app to reproduce the problem.
I'm guessing this is likely FreeBSD specific otherwise it would have received more attention before now.
That was my thought too. That said, I'm a little suspicious that this commit might be involved in why this works on openjdk13
https://github.com/battleblow/openjdk-jdk13u/commit/d30ce315cba51199a1c79aaa986cceb51e41b7a0
I'm going to try and back port that commit and see if it makes any difference
It did not (make a difference). Seemed worth a shot though. I'll dig more deeply into what is going on.
I didn't get as much time on this yesterday as I would like.
I did try the same test with Tomcat, and found that it didn't exhibit the problem. I.e., I'm running it with openjdk11 without seeing the same behaviour.
I considered doing a bisection search for the offending commit, but am not going to do this initially since the FreeBSD bits weren't added to the openjdk13 tree until relatively late in its release cycle.
At this point I'll do some more review of the commits against the likely parts of the source tree. If nothing obvious stands out I'll do some major backporting to attempt to get a working version of openjdk11 and try and trim things down till I find the change that resolves the problem.
Is it worth trying to get the jetty people involved in this ticket, since it only seems to happen there?
I would say definitely if it happens on other operating systems. They might at the very least be able to give some clue as to what piece of code in Jetty to look at it so we could tell what part of the JVM to focus on.
I think Jetty lead devs are still here in Sydney, in the next suburb to me. I'll not stand on their doorstep, but I'll see what I can do.
Do you have a minimum piece of code to reproduce the problem? Something you could share in a gist?
I've been using the test case you provided -- start Jetty and run curl in a loop.
I have created a gist here https://gist.github.com/angeloudy/fbc09619fd6c54544fd631b6f040cf1e#file-steps_to_reproduce-txt with the debug logs and all the steps I did to reproduce the problem.
It looks like we have the answer to what is happening here, basically summarised https://mail.openjdk.java.net/pipermail/nio-dev/2020-February/007055.html
@battleblow I wonder if you want to introduce yourself and your port to the Oracle team there. Perhaps a first step toward elevating FreeBSD to an official port, if that's a useful thing.
Someone needs to port some changes from JDK13 back to JDK11. Not sure if Oracle consider it a priority. Unfortunately it is outside my skill set, never having worked on any part of the JDK itself. Do we just try/catch the exception, or try and port most of this http://hg.openjdk.java.net/jdk/jdk/rev/193a8f1a4f3b
There is code I found in a patch for FreeBSD Java 6 which looks like this:
Xdiff -uNpr ../orig/jdk/src/solaris/native/sun/nio/ch/Net.c ./jdk/src/solaris/native/sun/nio/ch/Net.c X--- ../orig/jdk/src/solaris/native/sun/nio/ch/Net.c 2009-04-24 03:34:35.000000000 -0400 X+++ ./jdk/src/solaris/native/sun/nio/ch/Net.c 2009-05-14 23:41:34.000000000 -0400 X@@ -140,8 +140,31 @@ Java_sun_nio_ch_Net_localPort(JNIEnv en X SOCKADDR sa; X int sa_len = SOCKADDR_LEN; X if (getsockname(fdval(env, fdo), (struct sockaddr )&sa, &sa_len) < 0) { X+#ifdef _ALLBSD_SOURCE X+ / X+ XXXBSD: X+ ECONNRESET is specific to the BSDs. We can not return an error, X+ as the calling Java code with raise a java.lang.Error given the expectation X+ that getsockname() will never fail. According to the Single UNIX Specification, X+ it shouldn't fail. As such, we just fill in generic Linux-compatible values. X+ / X+ if (errno == ECONNRESET) { X+ struct sockaddr_in sin; X+ sin = (struct sockaddr_in ) &sa; X+ bzero(sin, sizeof(sin)); X+ sin->sin_len = sizeof(struct sockaddr_in); X+ sin->sin_family = AF_INET; X+ sin->sin_port = htonl(0); X+ sin->sin_addr.s_addr = INADDR_ANY; X+ } else { X+ handleSocketError(env, errno); X+ return -1; X+ } X+#else / _ALLBSD_SOURCE / X handleSocketError(env, errno); X return -1; X+#endif / _ALLBSD_SOURCE /
https://lists.freebsd.org/pipermail/freebsd-bugs/2009-August/036381.html
This looks like the problem has been found and fixed once before.
The fix you reference for openjdk6 is still in the openjdk11 source. See
In general, backporting a large patch like that adds significantly to our merge conflict version, so I wouldn't advocate for that if there is a smaller change we can make to fix the problem.
I'm wondering if the right solution for openjdk11 is to just not throw an exception here if errno is ECONNRESET
I'll give that a build and see if that solves the problem at least locally
Please try the attached patch and see if that solves the problem for you.
@battleblow I tried the patch you provided, and the problem is not happening any more. Can you merge the patch into freebsd port?
That is my plan. I want to get some review first though.
We have it on our test servers now and will put it into production next week if we don't hit any issues.
I've pushed this change into the repo and the FreeBSD port.
Please let me know what happens in production so we can close this out
We've been running this in production for a week now and everything is good.
FreeBSD 12.1 java/openjdk11.0.5+10.1 from ports jetty-9.4.24.v20191120 (although the problem certainly exists in 9.4.21 and probably other versions)
Inbound connections to a jetty server running HTTP/2 with either TLS 1.2 or 1.3 result in dropped connections randomly (about one in 100 at first). The connection remains in CLOSED state according to sockstat and is never released.
As the number of sockets builds up, more and more connections are dropped until the server stops responding entirely.
We've reproduced this in our own application and also noted it happening in Apache Solr 8.2.0.
Reverting to openjdk8 or upgrading to openjdk13 fixes the problem. Using -Djdk.net.usePlainSocketImpl with openjdk13 doesn't cause the problem to happen, so the work done in https://bugs.openjdk.java.net/browse/JDK-8218559 may not be relevant to this issue.