eclipse-vertx / vert.x

Vert.x is a tool-kit for building reactive applications on the JVM
http://vertx.io
Other
14.32k stars 2.08k forks source link

Correlate block thread checker with GC pauses #2449

Open vietj opened 6 years ago

vietj commented 6 years ago

The block thread checker sometimes reports GC pauses as misbehaving blocking application.

http://www.fasterj.com/articles/gcnotifs.shtml

The GC mbean emits notification about GC pauses (start time / end time) that could be used to mitigate this and avoid false positive.

vietj commented 6 years ago

another idea would be to have a dedicated thread that does a periodic volatile write of a timestamp, similar to what jhiccup does https://github.com/giltene/jHiccup/blob/master/src/main/java/org/jhiccup/HiccupMeter.java#L456

vietj commented 6 years ago

actually the BlockedThreadChecker uses a java.util.Timer , so instead it could use its own thread and determine when it really measure something useful, i.e if it has not been able to update a timestamp recently (i.e the last timestamp was old, there is a gap) then it means that it would report a false negative

KowalczykBartek commented 6 years ago

But, do you want to omit such message for thread blocked by GC pause or just hint user that this pause was (probably) caused by GC cycle ? I would (as Vert.x user) expect second option, but maybe others have different point of view ?

vietj commented 6 years ago

@KowalczykBartek

KowalczykBartek commented 6 years ago

ok, you are perfectly right @vietj about that vertx should not care if my GC cannot keep up :)

but, long pause don't have to be caused by full GC :)

P.S I have only one note to first solution (mbean based) - this start and stop time includes only GC, it do not include, how long it took to reach safepoint, so, this should be somehow checked also (long TTS is of course rare, but still possible :) ) - second approach mitigates this (I think so :D )

vietj commented 6 years ago

do you mean @KowalczykBartek systemic pauses ?

in all cases, what I mean is that if there is a pause that suspends all JVM threads, then the application should not be pointed out at the faulty one.

I think the solution based on jhicup way is better because it can be more accurate.

KowalczykBartek commented 6 years ago

@vietj , I mean : http://blog.ragozin.info/2012/10/safepoints-in-hotspot-jvm.html http://psy-lob-saw.blogspot.com/2014/03/where-is-my-safepoint.html

KowalczykBartek commented 6 years ago

Hi @vietj :)

do you see this second option (wit separated thread, similar to what jHiccup does), in (more or less) following way ?

while (true) {

  TimeUnit.NANOSECONDS.sleep(interval);

  final long timeAfterMeasurement = System.nanoTime();
  final long deltaTimeNsec = timeAfterMeasurement - lastTimeBeforeMeasurement;
  lastTimeBeforeMeasurement = timeAfterMeasurement;

  //by first run, deltaTimeNsec can be < 0 

  long hiccup = deltaTimeNsec - interval;

  if(hiccup > SOME_VALUE)
  {
    // SHOULD WE SKIP THIS CHECK ? 
  }
 (1)
  long now = System.nanoTime();
  for (VertxThread thread : threads.keySet()) {
    //OLD STUFF RESPONSIBLE FOR BLOCKER THREAD REPORTING
   (2)
  }
}

By this, I mean, do you want to just skip check because detected hiccup or make this measurement more detailed ?

P.S of course, this can also log false positive (so, pause generated by JVM/OS, not application business logic itself), because, for example, GC can take place between (1) and (2), because there is a lot of safepoints, that could potentially allow JVM to perform GC. P.S.2 because point above, this timeAfterMeasurement should be calculdated after for loop, because GC there can happen and next delay can be incorrect.

vietj commented 6 years ago

I think we need to have tests for it, but basically the idea is that

  1. the thread periodically (1 millisecond) samples itself
  2. every N samples (block period check / 1 millis) check the vertx threads

uses the information samples in 1. to determine or not if it's a false positive or not.

basically if during the past 10 seconds, we have not been able to samples (10 s / 1 ms) time it means there is an issue.

KowalczykBartek commented 6 years ago

I am still looking for inspiration :D maybe you can help me shape it somehow :)

first, I started from block-period-check, based on (block period check / 1 millis) iterations,

final long resolutionNsec = (long) (1.0 * 1000L * 1000L);

final long blockCheckPeriod = resolutionNsec * 1000;

try {
    long countedWakeup = 0;
    long expectedWakeup = 1000;

    long start = System.nanoTime();
    long stop;

    while (true) {

        //make a sleep
        TimeUnit.NANOSECONDS.sleep(resolutionNsec);

        ++countedWakeup;

        if (countedWakeup == expectedWakeup) {

            stop = System.nanoTime();

            long finalBlockCheckPeriod = stop - start;

            long jitter = (finalBlockCheckPeriod - blockCheckPeriod);

            System.err.println("check period : " + finalBlockCheckPeriod);
            System.err.println("jitter : " + jitter);

            countedWakeup = 0;
            start = System.nanoTime();
        }
    }
} catch (InterruptedException e) {
    //...
}

after each such iteration, I can calculate Jitter - this jitter should have some average value calculated after each iteration, but of course from time to time - when something will block for longer period (full GC ? TTS hiccup or OS problem?) thread-block-check should be skipped, but for this I guess some histogram should be delivered as well (or maybe some configurable value for this jitter max value should be used? **), what do you think ?

** - this histogram or configurable value, would be used to determine if check should be skipped or not.

vietj commented 6 years ago

@KowalczykBartek I will have a look when possible because this requires some brainstorming.

in the mean time perhaps you can try to write a test that shows false positive with the actual checker.

somebody reported a false positive recently on the mailing list Thread is blocked on unknown issue I believe that this is a false positive, it has a 100ms max blocking value but the checker seems to with the default behavior (which is every 1000ms), the idea would be to write a problem that blocks for 100+ms in a vertx task and see if the reported check is not the actual trace but a different trace like in the group thread.