eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.27k stars 721 forks source link

VM build failure - VMState: 0x00050080 or 0x000501ff #9416

Closed M-Davies closed 3 years ago

M-Davies commented 4 years ago

Got this on https://ci.adoptopenjdk.net/view/Failing%20Builds/job/build-scripts/job/jobs/job/jdk11u/job/jdk11u-mac-x64-openj9/573 (build-macstadium-macos1014-1)

* For target support_jmods_java.compiler.jmod:
Assertion failed at /Users/jenkins/workspace/build-scripts/jobs/jdk11u/jdk11u-mac-x64-openj9/workspace/build/src/build/macosx-x86_64-normal-server-release/vm/compiler/../compiler/compile/J9SymbolReferenceTable.cpp:811: containingClass != NULL
VMState: 0x00050080
    failed to get defining class of field ref cpIndex=35 in owning method J9Method=0xc5e3660
compiling jdk/internal/org/objectweb/asm/ClassWriter.<init>(I)V at level: warm

JVMDUMP039I Processing dump event "abort", detail "" at 2020/04/30 00:59:05 - please wait.
JVMDUMP032I JVM requested System dump using '/Users/jenkins/workspace/build-scripts/jobs/jdk11u/jdk11u-mac-x64-openj9/workspace/build/src/make/core.20200430.005905.52000.0001.dmp' in response to an event
JVMDUMP012E Error in System dump: The core file created by child process with pid = 52079 was not found. Expected to find core file with name "/cores/core.52079"
JVMDUMP032I JVM requested Java dump using '/Users/jenkins/workspace/build-scripts/jobs/jdk11u/jdk11u-mac-x64-openj9/workspace/build/src/make/javacore.20200430.005905.52000.0002.txt' in response to an event
JVMDUMP010I Java dump written to /Users/jenkins/workspace/build-scripts/jobs/jdk11u/jdk11u-mac-x64-openj9/workspace/build/src/make/javacore.20200430.005905.52000.0002.txt
JVMDUMP032I JVM requested Snap dump using '/Users/jenkins/workspace/build-scripts/jobs/jdk11u/jdk11u-mac-x64-openj9/workspace/build/src/make/Snap.20200430.005905.52000.0003.trc' in response to an event
JVMDUMP010I Snap dump written to /Users/jenkins/workspace/build-scripts/jobs/jdk11u/jdk11u-mac-x64-openj9/workspace/build/src/make/Snap.20200430.005905.52000.0003.trc
JVMDUMP007I JVM Requesting JIT dump using '/Users/jenkins/workspace/build-scripts/jobs/jdk11u/jdk11u-mac-x64-openj9/workspace/build/src/make/jitdump.20200430.005905.52000.0004.dmp'
JVMDUMP010I JIT dump written to /Users/jenkins/workspace/build-scripts/jobs/jdk11u/jdk11u-mac-x64-openj9/workspace/build/src/make/jitdump.20200430.005905.52000.0004.dmp
JVMDUMP013I Processed dump event "abort", detail "".

* All command lines available in /Users/jenkins/workspace/build-scripts/jobs/jdk11u/jdk11u-mac-x64-openj9/workspace/build/src/build/macosx-x86_64-normal-server-release/make-support/failure-logs.

Access to the machine can be granted by opening an issue at https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues

pshipton commented 3 years ago

https://ci.eclipse.org/openj9/job/Build_JDK11_x86-64_mac_OMR/822

[2021-02-02T21:25:51.888Z] Assertion failed at /Users/jenkins/workspace/Build_JDK11_x86-64_mac_OMR/openj9/runtime/compiler/compile/J9SymbolReferenceTable.cpp:882: containingClass != NULL
[2021-02-02T21:25:51.888Z] VMState: 0x00050080
[2021-02-02T21:25:51.888Z]  failed to get defining class of field ref cpIndex=13 in owning method J9Method=0xaa23900
[2021-02-02T21:25:51.888Z]  TR_ResolvedJ9Method::definingClassFromCPFieldRef=0xaa23e00, TR_ResolvedRelocatableJ9Method::definingClassFromCPFieldRef=0xaa23e00
[2021-02-02T21:25:51.888Z]  fromResolvedJ9Method=0xaa23e00
[2021-02-02T21:25:51.888Z]  TR_UseSymbolValidationManager=0
[2021-02-02T21:25:51.888Z] compiling sun/nio/fs/UnixChannelFactory.newFileChannel(ILsun/nio/fs/UnixPath;Ljava/lang/String;Ljava/util/Set;I)Ljava/nio/channels/FileChannel; at level: warm
Leonardo2718 commented 3 years ago

I've opened #11871 to add some tracing that should help further diagnose this issue.

0xdaryl commented 3 years ago

Moving this out to 0.26. A failure has not been seen since adding the new instrumentation to help us narrow this down, and the workaround has been committed to the 0.25 branch already.

dsouzai commented 3 years ago

Saw this on a PR build:

[2021-02-25T22:39:22.527Z] 0x1dcdaa00: relo::fieldAttrib, 10, 0x1ddcc510, 0x1ddcc6f0
[2021-02-25T22:39:22.527Z] 0x1dcdaa00: relo::storeValRecordIfNec, Found in local list, nothing to do (0x1ddcc510, 10, 0x1ddcc6f0)
[2021-02-25T22:39:22.527Z] 0x1dcdaa00: relo::storeValRecordIfNec, classChain NULL, 0x1ddcc300, 0x1ddcc510, 10, 0x1ddcc6f0
[2021-02-25T22:39:22.527Z] 0x1dcdaa00: relo::definingClassFromCPFieldRef !valid, 0x1ddcc510, 10, 0x1ddcc6f0
[2021-02-25T22:39:22.527Z] End (9728)
[2021-02-25T22:39:22.527Z] Assertion failed at /Users/jenkins/workspace/Build_JDK11_x86-64_mac_Personal/openj9/runtime/compiler/compile/J9SymbolReferenceTable.cpp:945: containingClass != NULL
[2021-02-25T22:39:22.527Z] VMState: 0x000501ff
[2021-02-25T22:39:22.527Z]  failed to get defining class of field ref cpIndex=10 in owning method J9Method=0x1ddcc6f0
[2021-02-25T22:39:22.527Z]  TR_ResolvedJ9Method::definingClassFromCPFieldRef=0x1ddcc300, TR_ResolvedRelocatableJ9Method::definingClassFromCPFieldRef=0x1ddcc300
[2021-02-25T22:39:22.527Z]  fromResolvedJ9Method=0x1ddcc300
[2021-02-25T22:39:22.527Z]  TR_UseSymbolValidationManager=0
[2021-02-25T22:39:22.527Z] compiling sun/nio/ch/FileChannelImpl.<init>(Ljava/io/FileDescriptor;Ljava/lang/String;ZZZLjava/lang/Object;)V at level: warm
[2021-02-25T22:39:22.527Z] 

Unfortunately, the PR that added this tracing missed a change that meant we didn't trace anything in rememberClass.... sigh.

The tracing output that did show tells us is that we got the class chain the first time, and failed to get the class chain the second time. Therefore, either the ROMClass stopped being in the the SCC or findChainForClass returned NULL. The ROMClass not being in the SCC would be kind of crazy since this is is happening during the build and not in an HCR test, though this is a legitimate reason for why we could get NULL in one query and non-NULL in another. So it seems like

   sharedCacheConfig()->findSharedData(vmThread,
                                       key,
                                       keyLength,
                                       J9SHR_DATA_TYPE_AOTCLASSCHAIN,
                                       FALSE,
                                       &dataDescriptor,
                                       NULL);

is returning NULL in one instance and non-NULL in another. Will have to enable the tracing to know for sure. ~I'll open a PR to ensure tracing happens in rememberClass~ Opened https://github.com/eclipse/openj9/pull/12058.

That said... as I mentioned above, with HCR it is actually possible for us to get non-NULL the first time and NULL the second time. It's possible for the class of the method we're compiling to get redefined and the compilation wouldn't be interrupted till we explicitly checked, which would not happen between the two queries. So, we're legitimately going to need to handle the possibility a null class chain.

pshipton commented 3 years ago

I found a failure in an internal build. No diagnostic files though. Created https://github.com/AdoptOpenJDK/openjdk-build/issues/2528 job/build-scripts/job/jobs/job/jdk11u/job/jdk11u-mac-x64-openj9-IBM/7

22:10:39  0x148f3a00: relo::fieldAttrib, 11, 0x149ccf70, 0x149cdcf8
22:10:39  0x148f3a00: findChainForClass 0x148f3a00, 0x149ce200, 0x3125b3a0, 0x424c9288
22:10:39  0x148f3a00: rememberClass: chain exists (0x424c9288) so nothing to store, 0x149ce200, 0x3125b3a0
22:10:39  0x148f3a00: relo::storeValRecordIfNec, Created new AOT class info 0xd3b4f80
22:10:39  0x148f3a00: findChainForClass 0x148f3a00, 0x149ce200, 0x3125b3a0, 0x0
22:10:39  0x148f3a00: rememberClass: unable to store chain 0x148f3a00, 0x3125b3a0
22:10:39  0x148f3a00: relo::storeValRecordIfNec, classChain NULL, 0x149ce200, 0x149ccf70, 11, 0x149cdcf8
22:10:39  0x148f3a00: relo::definingClassFromCPFieldRef !valid, 0x149ccf70, 11, 0x149cdcf8
22:10:39  End (9216)
22:10:39  Assertion failed at /Users/jenkins/workspace/build-scripts/jobs/jdk11u/jdk11u-mac-x64-openj9-IBM/workspace/build/src/openj9/runtime/compiler/compile/J9SymbolReferenceTable.cpp:929: containingClass != NULL
22:10:39  VMState: 0x000501ff
22:10:39    failed to get defining class of field ref cpIndex=11 in owning method J9Method=0x149cdcf8
22:10:39    TR_ResolvedJ9Method::definingClassFromCPFieldRef=0x149ce200, TR_ResolvedRelocatableJ9Method::definingClassFromCPFieldRef=0x149ce200
22:10:39    fromResolvedJ9Method=0x149ce200
22:10:39    TR_UseSymbolValidationManager=0
22:10:39  compiling sun/nio/fs/UnixChannelFactory.newFileChannel(ILsun/nio/fs/UnixPath;Ljava/lang/String;Ljava/util/Set;I)Ljava/nio/channels/FileChannel; at level: warm
22:10:39  
22:10:39  Unhandled exception
22:10:39  Type=Unhandled trap vmState=0x000501ff
22:10:39  J9Generic_Signal_Number=00000108 Signal_Number=00000005 Error_Value=00000000 Signal_Code=00000001
22:10:39  Handler1=0000000006436290 Handler2=0000000004B7AAE0
22:10:39  RDI=000000000000A403 RSI=0000000000000005 RAX=0000000000000000 RBX=0000700005D79000
22:10:39  RCX=0000700005D6EED8 RDX=0000000000000000 R8=0000000000000017 R9=0000000000000017
22:10:39  R10=0000000000000000 R11=0000000000000287 R12=000000000000A403 R13=000000000D270980
22:10:39  R14=0000000000000005 R15=000000000000002D
22:10:39  RIP=00007FFF618E02C6 GS=0000 FS=0000 RSP=0000700005D6EED8
22:10:39  RFlags=0000000000000286 CS=0007 RBP=0000700005D6EF10 ERR=06EA481002000148
22:10:39  TRAPNO=0200014800000085 CPU=4810020001480000 FAULTVADDR=0000000006EA4810
22:10:39  XMM0 ffff000000000000 (f: 0.000000, d: nan)
22:10:39  XMM1 0000000000000000 (f: 0.000000, d: 0.000000e+00)
22:10:39  XMM2 0000000000000000 (f: 0.000000, d: 0.000000e+00)
22:10:39  XMM3 0000000000000000 (f: 0.000000, d: 0.000000e+00)
22:10:39  XMM4 0000000000000000 (f: 0.000000, d: 0.000000e+00)
22:10:39  XMM5 0000000000000000 (f: 0.000000, d: 0.000000e+00)
22:10:39  XMM6 0000000000000000 (f: 0.000000, d: 0.000000e+00)
22:10:39  XMM7 0000000000000000 (f: 0.000000, d: 0.000000e+00)
22:10:39  XMM8 0000000000000000 (f: 0.000000, d: 0.000000e+00)
22:10:39  XMM9 0000000000000000 (f: 0.000000, d: 0.000000e+00)
22:10:39  XMM10 0000000000000000 (f: 0.000000, d: 0.000000e+00)
22:10:39  XMM11 0000000000000000 (f: 0.000000, d: 0.000000e+00)
22:10:39  XMM12 0000000000000000 (f: 0.000000, d: 0.000000e+00)
22:10:39  XMM13 0000000000000000 (f: 0.000000, d: 0.000000e+00)
22:10:39  XMM14 0000000000000000 (f: 0.000000, d: 0.000000e+00)
22:10:39  XMM15 0000000000000000 (f: 0.000000, d: 0.000000e+00)
22:10:39  Module=/usr/lib/system/libsystem_kernel.dylib
22:10:39  Module_base_address=00007FFF618D9000 Symbol=__pthread_kill
22:10:39  Symbol_address=00007FFF618E02BC
22:10:39  
22:10:39  Method_being_compiled=sun/nio/fs/UnixChannelFactory.newFileChannel(ILsun/nio/fs/UnixPath;Ljava/lang/String;Ljava/util/Set;I)Ljava/nio/channels/FileChannel;
22:10:39  Target=2_90_20210311_7 (Mac OS X 10.14.6)
22:10:39  CPU=amd64 (12 logical CPUs) (0x400000000 RAM)
22:10:39  ----------- Stack Backtrace -----------
22:10:39  ---------------------------------------
dsouzai commented 3 years ago

The place where findChainForClass is called is:

https://github.com/eclipse/openj9/blob/269003cc717d8fb7191426c71bcbafa86b296c15/runtime/compiler/env/J9SharedCache.cpp#L884-L901

The key that's passed into findChainForClass comes from romClass (which is also passed in as a param). The first time it's called:

22:10:39  0x148f3a00: findChainForClass 0x148f3a00, 0x149ce200, 0x3125b3a0, 0x424c9288

where the last value 0x424c9288 is the class chain returned by the findSharedData API, and second-last value 0x3125b3a0 is the romClass. This message gets printed out after the call to findSharedData: https://github.com/eclipse/openj9/blob/269003cc717d8fb7191426c71bcbafa86b296c15/runtime/compiler/env/J9SharedCache.cpp#L1135-L1150

The second time findChainForClass is called:

22:10:39  0x148f3a00: findChainForClass 0x148f3a00, 0x149ce200, 0x3125b3a0, 0x0

notice that the last value is 0x0 even though we're passing in the same romClass 0x3125b3a0. Because we're passing in the same romClass, the key that gets generated will be the same. Therefore, findSharedData is returning different values.

As such, I don't think this is a JIT issue.

fyi @hangshao0

hangshao0 commented 3 years ago

The only possible cases that findSharedData() could return 0 for classChian are:

  1. Failed to startup the ByteDataManager. https://github.com/eclipse/openj9/blob/269003cc717d8fb7191426c71bcbafa86b296c15/runtime/shared_common/CacheMap.cpp#L4050-L4052

    As the first time classChain is found, it is already started up. Won't be caused by this.

  2. Lockhash Table failed: https://github.com/eclipse/openj9/blob/269003cc717d8fb7191426c71bcbafa86b296c15/runtime/shared_common/Manager.cpp#L414-L418

    It returns false after failed ON MONITOR_ENTER_RETRY_TIMES (10) attempts. https://github.com/eclipse/openj9/blob/269003cc717d8fb7191426c71bcbafa86b296c15/runtime/shared_common/Manager.cpp#L472-L484

    I'm not sure if there is anything we can do in this case.

hangshao0 commented 3 years ago

The documentation of j9shr_findSharedData() is: https://github.com/eclipse/openj9/blob/269003cc717d8fb7191426c71bcbafa86b296c15/runtime/shared_common/shrinit.cpp#L1792-L1795

It won't be able to return the record in an error case.

dsouzai commented 3 years ago

Thanks @hangshao0

@0xdaryl @Leonardo2718 I think the assert should just be removed:

  1. In AOT it's possible to get different values based on the semantics of the findSharedData API.
  2. HCR means that even without AOT, it's possible to get different values (specifically non-NULL one time and NULL every time after) if classes get redefined; because J9Classes are redefined in place, we can end up with a different J9ROMClass that's not necessarily going to be in the SCC.
0xdaryl commented 3 years ago

If the assert is removed (implying it is invalid), will the subsequent code work as expected assuming that containingClass is NULL?

https://github.com/eclipse/openj9/blob/a9838e74bba0f5935103bedcda6fc84c881804ca/runtime/compiler/compile/J9SymbolReferenceTable.cpp#L940

Leonardo2718 commented 3 years ago

@0xdaryl

If the assert is removed (implying it is invalid), will the subsequent code work as expected assuming that containingClass is NULL?

I don't think simply removing the assert is valid. If containingClass can be null, it can cause the JIT to think two fields from different classes are actually the same field. At the very least, I think we should abort compilation as we've been doing in the release branches.

However, I believe there's a deeper issue as well.

  1. In AOT it's possible to get different values based on the semantics of the findSharedData API.

So, to me, it seems pretty reasonable to assume that once something is resolved it stays resolved (ignoring the HCR issue for a minute). If findSharedData can cause us to violate that assumption (or make it "look like" we did), then similar issues as this could impact other parts of the JIT that make a similar assumption, perhaps in even more subtle ways if there are no asserts.

  1. HCR means that even without AOT, it's possible to get different values (specifically non-NULL one time and NULL every time after) if classes get redefined; because J9Classes are redefined in place, we can end up with a different J9ROMClass that's not necessarily going to be in the SCC.

Redefining VM structures like this from under the JIT's nose seems rather dangerous. I would think that we would probably want to abort all current compilations if a class gets redefined (at least all the compilations that might try to look at the class that got redefined). Otherwise, for similar reasons as above, assumptions made in the JIT about the stability of the data structures could be violated and lead to a crash (even if the compiled body is thrown away in the end, the JIT could crash before compilation completes).

dsouzai commented 3 years ago

A way to deal with the HCR problem, in this specific case anyway, is to check whether the comp should be interrupted before the assert - If it does then we abort the compilation; if it doesn't then the assert will only fire when we hit an inconsistent state.

One way for us to deal with the fact that the SCC API could return NULL even if it returned non-NULL previously is to cache the class chain for each J9Class. There exists some (currently inactive) infrastructure to cache the validation result of a class chain for each J9Class - this is only used in the AOT load run. I believe we can piggy-back off this infra to also just cache the class chain per J9Class.

That said, the infra uses the CH Table; this is because we already have a whole bunch of code that ensures that the CH Table maintains coherence through all of the runtime events (such as class loading/unloading, redefinition, etc.). Therefore, if we were to piggy back off the aforementioned infra, we would depend on the existence of the CHTable. Currently we disable the maintenance of the table if we disable CHOpts for whatever reason. However, there's no reason why we can't maintain the table even if we're not going to make use of it.

Therefore, if we can ensure that the CHTable will always be coherent, then we can cache the classchain and not have to worry about whether or not we're going to incorrectly fail an AOT validation we passed earlier. However, I don't have the time to do this right now; I can open an issue to describe exactly what has to be done, but it's very unlikely I'll get to it in the next month or so.

Leonardo2718 commented 3 years ago

So, given there doesn't appear to be an easy way to fix the underlying issue, I've opened #12259 to change the assert into a failCompilation() call and remove the diagnostic instrumentation (this is the same fix we've been putting in release branches) and #12264 and #12265 to track the work of resolving the deeper problem.