ScottG489 / conjob

Simple web interface to run containers as jobs or serverless functions
MIT License
12 stars 0 forks source link

Fix unit test JRE crash #38

Open ScottG489 opened 3 years ago

ScottG489 commented 3 years ago

The failure seems to consistently fail at the same point.

TempSecretsFileUtilTest > createSecretsFile STANDARD_OUT
    timestamp = 2021-05-25T05:27:25.951897, TempSecretsFileUtilTest:createSecretsFile = 
                                  |-------------------jqwik-------------------
    tries = 1000                  | # of calls to property
    checks = 1000                 | # of not rejected calls
    generation = RANDOMIZED       | parameters are randomly generated
    after-failure = PREVIOUS_SEED | use the previous seed
    when-fixed-seed = ALLOW       | fixing the random seed is allowed
    edge-cases#mode = MIXIN       | edge cases are mixed in
    edge-cases#total = 4          | # of all combined edge cases
    edge-cases#tried = 4          | # of edge cases tried in current run
    seed = 8450192231027384562    | random seed to reproduce generated values

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (sharedRuntime.cpp:1261), pid=474, tid=494
#  guarantee((retry_count++ < 100)) failed: Could not resolve to latest version of redefined method
#
# JRE version: OpenJDK Runtime Environment (11.0.11+9) (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
# Java VM: OpenJDK 64-Bit Server VM (11.0.11+9-Ubuntu-0ubuntu2.20.04, mixed mode, sharing, tiered, compressed oops, serial gc, linux-amd64)
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P %E" (or dumping to /opt/build/conjob/core.474)
#
# An error report file with more information is saved as:
# /opt/build/conjob/hs_err_pid474.log
#
# If you would like to submit a bug report, please visit:
#   https://bugs.launchpad.net/ubuntu/+source/openjdk-lts
#

ConfigTaskTest > Given a conjob configuration, and new config values, when updating the config with new values, then fields in the config should be updated with new values, and fields not updated should be the same as the originals. SKIPPED

> Task :unitTest FAILED
:unitTest (Thread[Daemon worker,5,main]) completed. Took 1 mins 43.959 secs.
Closing Git repo: /opt/build/conjob/.git

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':unitTest'.
> Process 'Gradle Test Executor 1' finished with non-zero exit value 134
  This problem might be caused by incorrect test process configuration.
  Please refer to the test execution section in the User Manual at https://docs.gradle.org/6.7/userguide/java_testing.html#sec:test_execution

* Try:
Run with --stacktrace option to get the stack trace. Run with --debug option to get more log output. Run with --scan to get full insights.

* Get more help at https://help.gradle.org

BUILD FAILED in 3m 39s
10 actionable tasks: 10 executed

Unfortunately it's only reproducible on the build server, which is bad because the whole point is to have parity. In any case, a few ideas to fix this:

ScottG489 commented 3 years ago

Here are my thoughts on each bullet pointed idea above.

Upgrading JDK

Upgrading to JDK 13 seemed to fixed the issue. I don't really want to upgrade the JDK if possible, but don't think I have any good reason not to. It seems like it would be a good idea to keep up with the latest version if possible just like any other dependency?

Reporting issue to jqwik

It doesn't seem like it's an issue with jqwik because if I let the failing tests run but with their body's commented out the test passes. So to me that means jqwik is doing everything the same but my test isn't running and since it works it isn't jqwik. However, it seems like there is some interplay with the rest of the test suite because if only the failing tests are run (--tests) they pass. So this isn't completely ruled out. I'll have to play around with the rest of the test suite and see what is the external force that causes the offending tests to fail only when run with the whole suite.

Modify code to debug issue

Removing TempSecretsFileUtilTest didn't seem to have an effect but removing ConfigTaskTest fixed the issue. If either of its tests are enabled the failure is reproduced. The failure itself seems to occur in ConfigTask.execute(). Specifically, the offending code happens somewhere here: String originalConfig = configFieldMethods.entrySet().stream().map(configEntry ->t;. Line 53 (in the lambda) is never reached.

We define anonymous functions in this class and the error seems to have something to do with redefining methods. The JDK code that seems to blow up is in JDK 11src/hotspot/share/runtime/sharedRuntime.cpp. Here is the method with the relevant comment:

// Resolves a call.
methodHandle SharedRuntime::resolve_helper(JavaThread *thread,
                                           bool is_virtual,
                                           bool is_optimized, TRAPS) {
  methodHandle callee_method;
  callee_method = resolve_sub_helper(thread, is_virtual, is_optimized, THREAD);
  if (JvmtiExport::can_hotswap_or_post_breakpoint()) {
    int retry_count = 0;
    while (!HAS_PENDING_EXCEPTION && callee_method->is_old() &&
           callee_method->method_holder() != SystemDictionary::Object_klass()) {
      // If has a pending exception then there is no need to re-try to
      // resolve this method.
      // If the method has been redefined, we need to try again.
      // Hack: we have no way to update the vtables of arrays, so don't
      // require that java.lang.Object has been updated.

      // It is very unlikely that method is redefined more than 100 times
      // in the middle of resolve. If it is looping here more than 100 times
      // means then there could be a bug here.
      guarantee((retry_count++ < 100),
                "Could not resolve to latest version of redefined method");
      // method is redefined in the middle of resolve so re-try.
      callee_method = resolve_sub_helper(thread, is_virtual, is_optimized, THREAD);
    }
  }
  return callee_method;
}

Next steps

I think we should refactor ConfigTask.java and break it up into smaller classes and test those individually. Hopefully splitting up those tests will ease whatever issue is happening here.

If that doesn't work we'll have to circle back on this. I think after that we might want to play around with running a subset of tests to see what is causing the JRE crash only when the entire test suite is run. Some ideas there would be to see how many tests we can run before the crash (I think tests always run in order so try disabling tests after the crash and then binary search to find the number of tests that cause the crash), and try playing around with the number of tries in tests such as reducing them all to tries = 1, etc. If tries seem to affect it then it might be a good idea to make an issue with jqwik. Not that I think there's a bug in their lib, but to get some feedback on if they think using jqwik might be exacerbating things to cause this issue.

ScottG489 commented 3 years ago

Code in ConfigTask was extracted into dependency classes and tests were refactored and added accordingly. ConfigStore now has the bit of code which seems to be causing this problem (from previous comment):

Specifically, the offending code happens somewhere here: String originalConfig = configFieldMethods.entrySet().stream().map(configEntry ->t;. Line 53 (in the lambda) is never reached.

The test for the particular bit of code causing the problem is still reproducing it. For now I am going to just comment out this code. We'll lack unit test coverage there, but we should be able to get it covered with integration tests, and it's already covered with acceptance tests.

As discussed in the previous comment above, it seems that the solution here may come when we upgrade Java versions.

Next steps