eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.28k stars 722 forks source link

OpenJ9 hangs on high load #18103

Open anuragagarwal561994 opened 1 year ago

anuragagarwal561994 commented 1 year ago

We have a tomcat server running inside openj9 and at the pod startup a lot of request starts coming to it.

Since there is no cache or connections initially made the latency is higher and there are few contentions at start, all the connection queue and accept queue starts filling up.

Sometimes everything stabilises after few seconds or otherwise the JVM hangs up. When I say JVM hangs up it means that even when the system is supposed to be stabilised JVM has become so slow that any new request that I make to this application completes in more than 3-5 minutes.

We initially suspected this to be an issue with thread pools or tomcat first so we started using another port and then tried with a different side running parallel to tomcat in the jvm process but everything is stuck.

So we tried taking the thread dump using jstack and this is stuck as well for close to 5-10 minutes. Any other command like jmap is also not working and taking time.

Then we tried using kill -3 to take the java core dump and it works instantly and yields thread dump and other useful information but we can tried investigating it and didn't see anything wrong with the application itself.

We tried take multiple dumps at minute apart and also tried keeping the pod like this for more than 2-3 hours but it never recovers.

We suspected wither OS or JVM to be having an issue here, so we tried to use HotSpot JVM and the issue never happens even after trying multiple times. Even the initial warmup is much faster, so we suspect some issue with OpenJ9 JVM.

We are using the latest java 17.0.8 and openj9 version 0.40.0.

Although we don't know which version actually led to these issues since we regularly use the updated version and the issue has been there from a long time just that we had started investigating now.

The oldest version we tried is 17.0.4 I have to check again what is the openj9 version in it.

pshipton commented 1 year ago

Pls collect verbose GC logs and provide them, https://eclipse.dev/openj9/docs/vgclog/. Also a couple of javacores 1 minute apart.

anuragagarwal561994 commented 1 year ago

So what is the best way to share the core files in private, since this is an enterprise application.

pshipton commented 1 year ago

Use any file sharing service you trust and password protect the files, send me the details via email.

anuragagarwal561994 commented 4 months ago

Hey, sorry I was not able to share this was a production application, but I think almost after an year we observed something.

For quick resolution we shifted to hotspot and there were no issues seen, which made us believe that it might not be an application bug but may JVM bug.

We have been scaling GRPC at our organisation and at that time we were slowly scaling different services to grpc from http when slowly things started getting worse day by day.

But we were not able to identify this earlier.

Now when we are moving one of our heavy services to use GRPC and it is using openj9 we started seeing the same issue again in that service. Note that we have a feature flag to control the decision of sending the request in http or grpc to this service.

When grpc is not running, there are no hang ups seen, but while scaling the feature to lets say 50-50 then we start seeing that few pods become unresponsive, scaling it back to 0 eradicates the issue. Do you know by chance what might be happening, something in JIT maybe?

pshipton commented 4 months ago

Not off hand, we'd need to collect some diagnostic files and take a look.

anuragagarwal561994 commented 4 months ago

Let me try collecting the data again, since it has been sometime I have used openj9, can you guide me on what needs to be collected for your debugging purpose and with which command can I get the details?

pshipton commented 4 months ago

You can turn on verbose gc logs: -verbose:gc -Xverbosegclog:gclog,5,1000. https://eclipse.dev/openj9/docs/vgclog/ https://eclipse.dev/openj9/docs/xverbosegclog/

Also when a hang up is occurring, send kill -QUIT <pid> to the process to generate a javacore*.txt file. Wait about a minute and do it again, wait another minute and do it again. Collect the 3 javacore files, and the gclog.001 through 005 files. Zip them up and either attach them to this public issue, or send them to me privately.

anuragagarwal561994 commented 3 months ago

@pshipton I have sent you the mail with the attached dump files.

anuragagarwal561994 commented 2 months ago

@pshipton did you happen to find anything on this?

pshipton commented 2 months ago

The javacores show

1LKDEADLOCK    Deadlock detected !!!
NULL           ---------------------
NULL
2LKDEADLOCKTHR  Thread "pool-20-thread-121" (0x0000000003EB5200)
3LKDEADLOCKWTR    is waiting for:
4LKDEADLOCKOBJ      java/util/concurrent/locks/ReentrantLock$NonfairSync@0x00000004A24171F8
3LKDEADLOCKOWN    which is owned by:
2LKDEADLOCKTHR  Thread "pool-3-thread-1" (0x0000000001320A00)
3LKDEADLOCKWTR    which is waiting for:
4LKDEADLOCKMON      sys_mon_t:0x000078DC6870B7C8 infl_mon_t: 0x000078DC6870B848:
4LKDEADLOCKOBJ      java/util/concurrent/ConcurrentHashMap$Node@0x0000000526F17390
3LKDEADLOCKOWN    which is owned by:
2LKDEADLOCKTHR  Thread "pool-20-thread-121" (0x0000000003EB5200)
anuragagarwal561994 commented 2 months ago

@pshipton that was in the first dump files I sent, those were incorrect as there was a potential deadlock in the code. In the next mail I have sent a drive link where the above deadlock is removed and highlights the actual issue instead.

manqingl commented 2 months ago

Suggest to open a Sales Force case here : https://w3.ibm.com/w3publisher/ibm_runtime_technologies/support/reporting-java-issues

The following link explains how to collect MustGather data:

https://www.ibm.com/support/pages/mustgather-read-first-runtimes-java-technology

@anuragagarwal561994 Anurag Agarwal

pshipton commented 2 months ago

I don't see anything obviously wrong. Have you tried a profiler to see what it's doing?

anuragagarwal561994 commented 2 months ago

Thats what the thing is, the JVM becomes so slow that nothing really works like jstack or jmap so it is strange and hardly visible and happens on random machines.

pshipton commented 2 months ago

Are you entitled to any IBM support?

anuragagarwal561994 commented 2 months ago

No we don't have that, we are just using it as a open source project.

pshipton commented 2 months ago

See the IBM information about the diagnostic data to collect. Perhaps something in the list will provide some clues about the problem. If we can't figure it out and you are able to provide to me everything in the list, I can try to get help from IBM support on your behalf, based on the theory there is a problem with OpenJ9 that we want to find and fix.

https://www.ibm.com/support/pages/mustgather-setup-hang-or-performance-problems-linux https://www.ibm.com/support/pages/mustgather-gathering-data-hang-or-performance-problems-linux