Open anuragagarwal561994 opened 1 year ago
Pls collect verbose GC logs and provide them, https://eclipse.dev/openj9/docs/vgclog/. Also a couple of javacores 1 minute apart.
So what is the best way to share the core files in private, since this is an enterprise application.
Use any file sharing service you trust and password protect the files, send me the details via email.
Hey, sorry I was not able to share this was a production application, but I think almost after an year we observed something.
For quick resolution we shifted to hotspot and there were no issues seen, which made us believe that it might not be an application bug but may JVM bug.
We have been scaling GRPC at our organisation and at that time we were slowly scaling different services to grpc from http when slowly things started getting worse day by day.
But we were not able to identify this earlier.
Now when we are moving one of our heavy services to use GRPC and it is using openj9 we started seeing the same issue again in that service. Note that we have a feature flag to control the decision of sending the request in http or grpc to this service.
When grpc is not running, there are no hang ups seen, but while scaling the feature to lets say 50-50 then we start seeing that few pods become unresponsive, scaling it back to 0 eradicates the issue. Do you know by chance what might be happening, something in JIT maybe?
Not off hand, we'd need to collect some diagnostic files and take a look.
Let me try collecting the data again, since it has been sometime I have used openj9, can you guide me on what needs to be collected for your debugging purpose and with which command can I get the details?
You can turn on verbose gc logs: -verbose:gc -Xverbosegclog:gclog,5,1000
.
https://eclipse.dev/openj9/docs/vgclog/
https://eclipse.dev/openj9/docs/xverbosegclog/
Also when a hang up is occurring, send kill -QUIT <pid>
to the process to generate a javacore*.txt file. Wait about a minute and do it again, wait another minute and do it again. Collect the 3 javacore files, and the gclog.001 through 005 files. Zip them up and either attach them to this public issue, or send them to me privately.
@pshipton I have sent you the mail with the attached dump files.
@pshipton did you happen to find anything on this?
The javacores show
1LKDEADLOCK Deadlock detected !!!
NULL ---------------------
NULL
2LKDEADLOCKTHR Thread "pool-20-thread-121" (0x0000000003EB5200)
3LKDEADLOCKWTR is waiting for:
4LKDEADLOCKOBJ java/util/concurrent/locks/ReentrantLock$NonfairSync@0x00000004A24171F8
3LKDEADLOCKOWN which is owned by:
2LKDEADLOCKTHR Thread "pool-3-thread-1" (0x0000000001320A00)
3LKDEADLOCKWTR which is waiting for:
4LKDEADLOCKMON sys_mon_t:0x000078DC6870B7C8 infl_mon_t: 0x000078DC6870B848:
4LKDEADLOCKOBJ java/util/concurrent/ConcurrentHashMap$Node@0x0000000526F17390
3LKDEADLOCKOWN which is owned by:
2LKDEADLOCKTHR Thread "pool-20-thread-121" (0x0000000003EB5200)
@pshipton that was in the first dump files I sent, those were incorrect as there was a potential deadlock in the code. In the next mail I have sent a drive link where the above deadlock is removed and highlights the actual issue instead.
Suggest to open a Sales Force case here : https://w3.ibm.com/w3publisher/ibm_runtime_technologies/support/reporting-java-issues
The following link explains how to collect MustGather data:
https://www.ibm.com/support/pages/mustgather-read-first-runtimes-java-technology
@anuragagarwal561994 Anurag Agarwal
I don't see anything obviously wrong. Have you tried a profiler to see what it's doing?
Thats what the thing is, the JVM becomes so slow that nothing really works like jstack or jmap so it is strange and hardly visible and happens on random machines.
Are you entitled to any IBM support?
No we don't have that, we are just using it as a open source project.
See the IBM information about the diagnostic data to collect. Perhaps something in the list will provide some clues about the problem. If we can't figure it out and you are able to provide to me everything in the list, I can try to get help from IBM support on your behalf, based on the theory there is a problem with OpenJ9 that we want to find and fix.
https://www.ibm.com/support/pages/mustgather-setup-hang-or-performance-problems-linux https://www.ibm.com/support/pages/mustgather-gathering-data-hang-or-performance-problems-linux
We have a tomcat server running inside openj9 and at the pod startup a lot of request starts coming to it.
Since there is no cache or connections initially made the latency is higher and there are few contentions at start, all the connection queue and accept queue starts filling up.
Sometimes everything stabilises after few seconds or otherwise the JVM hangs up. When I say JVM hangs up it means that even when the system is supposed to be stabilised JVM has become so slow that any new request that I make to this application completes in more than 3-5 minutes.
We initially suspected this to be an issue with thread pools or tomcat first so we started using another port and then tried with a different side running parallel to tomcat in the jvm process but everything is stuck.
So we tried taking the thread dump using jstack and this is stuck as well for close to 5-10 minutes. Any other command like jmap is also not working and taking time.
Then we tried using kill -3 to take the java core dump and it works instantly and yields thread dump and other useful information but we can tried investigating it and didn't see anything wrong with the application itself.
We tried take multiple dumps at minute apart and also tried keeping the pod like this for more than 2-3 hours but it never recovers.
We suspected wither OS or JVM to be having an issue here, so we tried to use HotSpot JVM and the issue never happens even after trying multiple times. Even the initial warmup is much faster, so we suspect some issue with OpenJ9 JVM.
We are using the latest java 17.0.8 and openj9 version 0.40.0.
Although we don't know which version actually led to these issues since we regularly use the updated version and the issue has been there from a long time just that we had started investigating now.
The oldest version we tried is 17.0.4 I have to check again what is the openj9 version in it.