eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.28k stars 721 forks source link

Snap dumps cause empty thread and GC history sections in javacores #12709

Open paulcheeseman opened 3 years ago

paulcheeseman commented 3 years ago

When a snap dump is generated, the trace buffers are flushed. This appears to be intentional, but there is an undesirable side effect that occurs under certain cirumstances.

When a multithreaded process runs out of Java heap space, several threads can throw OutOfMemoryErrors at almost exactly the same time, and dumps are simultaneously triggered for all these threads. The default Xdump agents for OOMs are configured with priority values that should lead to dumps being triggered in this order:

system dump -> heapdump -> javacore -> snap dump

However, this ordering isn't always respected, presumably because the default agents are not all specified with the serial option. For example, here's some stderr output from a real multithreaded app hitting an OOM on the IBM JRE:

JVMDUMP032I JVM requested System dump using '<redacted>\core.20210514.080812.3712.0001.dmp' in response to an event
JVMDUMP010I System dump written to <redacted>\core.20210514.080812.3712.0001.dmp
JVMDUMP032I JVM requested Heap dump using '<redacted>\heapdump.20210514.080812.3712.0002.phd' in response to an event
JVMDUMP010I Heap dump written to <redacted>\heapdump.20210514.080812.3712.0002.phd
JVMDUMP032I JVM requested Heap dump using '<redacted>\heapdump.20210514.080812.3712.0003.phd' in response to an event
JVMDUMP010I Heap dump written to <redacted>\heapdump.20210514.080812.3712.0003.phd
JVMDUMP032I JVM requested Heap dump using '<redacted>\heapdump.20210514.080812.3712.0004.phd' in response to an event
JVMDUMP010I Heap dump written to <redacted>\heapdump.20210514.080812.3712.0004.phd
JVMDUMP032I JVM requested Heap dump using '<redacted>\heapdump.20210514.080812.3712.0005.phd' in response to an event
JVMDUMP010I Heap dump written to <redacted>\heapdump.20210514.080812.3712.0005.phd
JVMDUMP032I JVM requested Snap dump using '<redacted>\Snap.20210514.080812.3712.0010.trc' in response to an event
JVMDUMP010I Snap dump written to <redacted>\Snap.20210514.080812.3712.0010.trc
JVMDUMP013I Processed dump event "systhrow", detail "java/lang/OutOfMemoryError".
JVMDUMP032I JVM requested Java dump using '<redacted>\javacore.20210514.080812.3712.0006.txt' in response to an event
JVMDUMP010I Java dump written to <redacted>\javacore.20210514.080812.3712.0006.txt
JVMDUMP032I JVM requested Snap dump using '<redacted>\Snap.20210514.080812.3712.0011.trc' in response to an event
JVMDUMP010I Snap dump written to {nothing to snap}
JVMDUMP013I Processed dump event "systhrow", detail "java/lang/OutOfMemoryError".
JVMDUMP032I JVM requested Java dump using '<redacted>\javacore.20210514.080812.3712.0007.txt' in response to an event
JVMDUMP010I Java dump written to <redacted>\javacore.20210514.080812.3712.0007.txt
JVMDUMP032I JVM requested Snap dump using '<redacted>\Snap.20210514.080812.3712.0012.trc' in response to an event
JVMDUMP010I Snap dump written to {nothing to snap}
JVMDUMP013I Processed dump event "systhrow", detail "java/lang/OutOfMemoryError".
JVMDUMP032I JVM requested Java dump using '<redacted>\javacore.20210514.080812.3712.0008.txt' in response to an event
JVMDUMP010I Java dump written to <redacted>\javacore.20210514.080812.3712.0008.txt
JVMDUMP032I JVM requested Snap dump using '<redacted>\Snap.20210514.080812.3712.0013.trc' in response to an event
JVMDUMP010I Snap dump written to {nothing to snap}
JVMDUMP013I Processed dump event "systhrow", detail "java/lang/OutOfMemoryError".
JVMDUMP032I JVM requested Java dump using '<redacted>\javacore.20210514.080812.3712.0009.txt' in response to an event
JVMDUMP010I Java dump written to <redacted>\javacore.20210514.080812.3712.0009.txt
JVMDUMP013I Processed dump event "systhrow", detail "java/lang/OutOfMemoryError".

As you can see, in this case the dumps were produced in this order:

system dump -> heapdump -> heapdump -> heapdump -> heapdump -> snap dump -> javacore -> snap dump -> javacore -> snap dump -> javacore -> snap dump -> javacore

The problem here is that the first snap dump happens before the first javacore. This means that the "GC History" and "Current thread history" sections in the javacore are empty, because the trace buffers have been flushed before the javacore is generated:

1STGCHTYPE     GC History  
NULL   
1XECTHTYPE     Current thread history (J9VMThread:0x0000000003C28400)
NULL

This is not a desirable result. Yes, the snap dump should contain the missing trace data, but I suspect that most users and support teams are much more accustomed to viewing javacore files than processing snap dumps.

I'm not sure why the trace buffers are flushed when snap dumps are generated (or what downsides are associated with not flushing them), but perhaps the current behaviour needs to be reconsidered?

gacholio commented 3 years ago

@keithc-ca

keithc-ca commented 3 years ago

I see five instances of

JVMDUMP013I Processed dump event "systhrow", detail "java/lang/OutOfMemoryError".

in the console output in the description which suggests there are 5 (or more) threads involved. The results are consistent with each thread independently processing dumps in (dump) priority order:

Dumps can be configured for a larger number of java and snap dumps, say range=1..10, which would allow for up to ten threads encountering OutOfMemoryError (nearly) simultaneously without loss of trace buffers.

One might argue that dump priorities should be global instead of per thread, however, that will require some thought about whether we can avoid deadlock in all cases.

paulcheeseman commented 3 years ago

@keithc-ca

Do you have any thoughts about the suggestion of not flushing the trace buffers when taking a snap dump?

keithc-ca commented 3 years ago

From my (perhaps limited) understanding of how tracing is managed, the per-thread trace buffers need to be merged before the information can be written to a file. That 'merging' is like 'flushing'. It's not clear there's an alternative that wouldn't have significant performance implications.