Network sockets not properly closed

ari commented 5 years ago

FreeBSD 12.1 java/openjdk11.0.5+10.1 from ports jetty-9.4.24.v20191120 (although the problem certainly exists in 9.4.21 and probably other versions)

Inbound connections to a jetty server running HTTP/2 with either TLS 1.2 or 1.3 result in dropped connections randomly (about one in 100 at first). The connection remains in CLOSED state according to sockstat and is never released.

As the number of sockets builds up, more and more connections are dropped until the server stops responding entirely.

We've reproduced this in our own application and also noted it happening in Apache Solr 8.2.0.

Reverting to openjdk8 or upgrading to openjdk13 fixes the problem. Using -Djdk.net.usePlainSocketImpl with openjdk13 doesn't cause the problem to happen, so the work done in https://bugs.openjdk.java.net/browse/JDK-8218559 may not be relevant to this issue.

angeloudy commented 5 years ago

There is an exception when this happens

10:02:24.725 [bootique-http-86] WARN  org.eclipse.jetty.io.ManagedSelector:510 - Fatal select() failure {}[]
java.io.IOException: Connection reset by peer
        at java.base/sun.nio.ch.FileDispatcherImpl.close0(Native Method)
        at java.base/sun.nio.ch.SocketDispatcher.close(SocketDispatcher.java:55)
        at java.base/sun.nio.ch.SocketChannelImpl.kill(SocketChannelImpl.java:907)
        at java.base/sun.nio.ch.SelectorImpl.processDeregisterQueue(SelectorImpl.java:267)
        at java.base/sun.nio.ch.KQueueSelectorImpl.doSelect(KQueueSelectorImpl.java:138)
        at java.base/sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:124)
        at java.base/sun.nio.ch.SelectorImpl.select(SelectorImpl.java:141)
        at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.select(ManagedSelector.java:472)
        at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.produce(ManagedSelector.java:409)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produceTask(EatWhatYouKill.java:360)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:184)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)
        at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:367)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:782)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:914)
        at java.base/java.lang.Thread.run(Thread.java:834)

It looks like java couldn't close the file descriptor properly.

battleblow commented 5 years ago

Does any request work? Just asking in terms of reproducing the problem. E.g., if I set up Jetty and hit it with simple wget or curl requests will this trigger the problem in your experience or haven't you tried that?

battleblow commented 5 years ago

Some observations:

It is worth noting that src/java.base/unix/native/libnio/ch/FileDispatcherImpl.c is identical in both the 11 and 13 trees.
The implementation of close0() is essentially to just call close(2).
A failure of close(2) can be due to ECONNRESET, which is the error above.

ari commented 5 years ago

Because it takes a while to make the problem happen we've been doing this:

# for i in {1..100}; do for j in {1..20}; do curl -L https://ourserver.com/something >/dev/null 2>&1& done && wait; done

That tends to make it happen within about 15 seconds. It doesn't seem to matter whether the request is causing jetty to return a file from disk or just something generated in memory.

We can make it happen easily enough, but we haven't tested:

http 1
no SSL/TLS
other operating systems beyond FreeBSD 12.1 and 11.2
web servers other than jetty (such as tomcat)
other than noting Solr experienced a similar problem for us, we haven't tried creating a reproducible snippet of code. We just breathed a sigh of relief when Java13 made our problems go away. We would prefer to return to Java11 for a long term supported platform.

battleblow commented 5 years ago

I've reproduced this with jetty 9.4.24 on my 11.3 box.

The problem reproduces with both openjdk11 and openjdk12, but works correctly with openjdk13.

ari commented 5 years ago

Thanks so much for the time you have spent on this. We were just about to try and create an app to reproduce the problem.

I'm guessing this is likely FreeBSD specific otherwise it would have received more attention before now.

battleblow commented 5 years ago

That was my thought too. That said, I'm a little suspicious that this commit might be involved in why this works on openjdk13

https://github.com/battleblow/openjdk-jdk13u/commit/d30ce315cba51199a1c79aaa986cceb51e41b7a0

battleblow commented 5 years ago

I'm going to try and back port that commit and see if it makes any difference

battleblow commented 5 years ago

It did not (make a difference). Seemed worth a shot though. I'll dig more deeply into what is going on.

battleblow commented 5 years ago

I didn't get as much time on this yesterday as I would like.

I did try the same test with Tomcat, and found that it didn't exhibit the problem. I.e., I'm running it with openjdk11 without seeing the same behaviour.

I considered doing a bisection search for the offending commit, but am not going to do this initially since the FreeBSD bits weren't added to the openjdk13 tree until relatively late in its release cycle.

At this point I'll do some more review of the commits against the likely parts of the source tree. If nothing obvious stands out I'll do some major backporting to attempt to get a working version of openjdk11 and try and trim things down till I find the change that resolves the problem.

ari commented 4 years ago

Is it worth trying to get the jetty people involved in this ticket, since it only seems to happen there?

battleblow commented 4 years ago

I would say definitely if it happens on other operating systems. They might at the very least be able to give some clue as to what piece of code in Jetty to look at it so we could tell what part of the JVM to focus on.

ari commented 4 years ago

I think Jetty lead devs are still here in Sydney, in the next suburb to me. I'll not stand on their doorstep, but I'll see what I can do.

Do you have a minimum piece of code to reproduce the problem? Something you could share in a gist?

battleblow commented 4 years ago

I've been using the test case you provided -- start Jetty and run curl in a loop.

angeloudy commented 4 years ago

I have created a gist here https://gist.github.com/angeloudy/fbc09619fd6c54544fd631b6f040cf1e#file-steps_to_reproduce-txt with the debug logs and all the steps I did to reproduce the problem.

ari commented 4 years ago

It looks like we have the answer to what is happening here, basically summarised https://mail.openjdk.java.net/pipermail/nio-dev/2020-February/007055.html

@battleblow I wonder if you want to introduce yourself and your port to the Oracle team there. Perhaps a first step toward elevating FreeBSD to an official port, if that's a useful thing.

Someone needs to port some changes from JDK13 back to JDK11. Not sure if Oracle consider it a priority. Unfortunately it is outside my skill set, never having worked on any part of the JDK itself. Do we just try/catch the exception, or try and port most of this http://hg.openjdk.java.net/jdk/jdk/rev/193a8f1a4f3b

angeloudy commented 4 years ago

There is code I found in a patch for FreeBSD Java 6 which looks like this:

Xdiff -uNpr ../orig/jdk/src/solaris/native/sun/nio/ch/Net.c ./jdk/src/solaris/native/sun/nio/ch/Net.c X--- ../orig/jdk/src/solaris/native/sun/nio/ch/Net.c 2009-04-24 03:34:35.000000000 -0400 X+++ ./jdk/src/solaris/native/sun/nio/ch/Net.c 2009-05-14 23:41:34.000000000 -0400 X@@ -140,8 +140,31 @@ Java_sun_nio_ch_Net_localPort(JNIEnv en X SOCKADDR sa; X int sa_len = SOCKADDR_LEN; X if (getsockname(fdval(env, fdo), (struct sockaddr )&sa, &sa_len) < 0) { X+#ifdef _ALLBSD_SOURCE X+ / X+ XXXBSD: X+ ECONNRESET is specific to the BSDs. We can not return an error, X+ as the calling Java code with raise a java.lang.Error given the expectation X+ that getsockname() will never fail. According to the Single UNIX Specification, X+ it shouldn't fail. As such, we just fill in generic Linux-compatible values. X+ / X+ if (errno == ECONNRESET) { X+ struct sockaddr_in sin; X+ sin = (struct sockaddr_in ) &sa; X+ bzero(sin, sizeof(sin)); X+ sin->sin_len = sizeof(struct sockaddr_in); X+ sin->sin_family = AF_INET; X+ sin->sin_port = htonl(0); X+ sin->sin_addr.s_addr = INADDR_ANY; X+ } else { X+ handleSocketError(env, errno); X+ return -1; X+ } X+#else / _ALLBSD_SOURCE / X handleSocketError(env, errno); X return -1; X+#endif / _ALLBSD_SOURCE /

https://lists.freebsd.org/pipermail/freebsd-bugs/2009-August/036381.html

This looks like the problem has been found and fixed once before.

battleblow commented 4 years ago

The fix you reference for openjdk6 is still in the openjdk11 source. See

https://github.com/battleblow/openjdk-jdk11u/blob/bsd-port/src/java.base/unix/native/libnio/ch/Net.c#L340

battleblow commented 4 years ago

In general, backporting a large patch like that adds significantly to our merge conflict version, so I wouldn't advocate for that if there is a smaller change we can make to fix the problem.

battleblow commented 4 years ago

I'm wondering if the right solution for openjdk11 is to just not throw an exception here if errno is ECONNRESET

https://github.com/battleblow/openjdk-jdk11u/blob/bsd-port/src/java.base/unix/native/libnio/ch/FileDispatcherImpl.c#L291

I'll give that a build and see if that solves the problem at least locally

battleblow commented 4 years ago

Please try the attached patch and see if that solves the problem for you.

close.diff.txt

angeloudy commented 4 years ago

@battleblow I tried the patch you provided, and the problem is not happening any more. Can you merge the patch into freebsd port?

battleblow commented 4 years ago

That is my plan. I want to get some review first though.

ari commented 4 years ago

We have it on our test servers now and will put it into production next week if we don't hit any issues.

battleblow commented 4 years ago

I've pushed this change into the repo and the FreeBSD port.

Please let me know what happens in production so we can close this out

ari commented 4 years ago

We've been running this in production for a week now and everything is good.

battleblow / openjdk-jdk11u

Network sockets not properly closed #94