census-instrumentation / opencensus-java

A stats collection and distributed tracing framework
https://opencensus.io
Apache License 2.0
672 stars 201 forks source link

Clock Skew(?) can cause the distruptor thread to crash #2068

Closed steveniemitz closed 3 years ago

steveniemitz commented 3 years ago

Please answer these questions before submitting a bug report.

What version of OpenCensus are you using?

0.24.0, but looks to be present in all versions

What JVM are you using (java -version)?

1.8.0

Occasionally, at process start, we'll see this error in our logs:

Exception in thread "OpenCensus.Disruptor-0" java.lang.RuntimeException: java.lang.IllegalArgumentException: Current time must be within or after the last bucket.
    at com.lmax.disruptor.FatalExceptionHandler.handleEventException(FatalExceptionHandler.java:45)
    at com.lmax.disruptor.dsl.ExceptionHandlerWrapper.handleEventException(ExceptionHandlerWrapper.java:18)
    at com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:187)
    at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:125)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: Current time must be within or after the last bucket.
    at com.google.common.base.Preconditions.checkArgument(Preconditions.java:141)
    at io.opencensus.implcore.stats.MutableViewData$IntervalMutableViewData.refreshBucketList(MutableViewData.java:308)
    at io.opencensus.implcore.stats.MutableViewData$IntervalMutableViewData.record(MutableViewData.java:262)
    at io.opencensus.implcore.stats.MeasureToViewMap.record(MeasureToViewMap.java:160)
    at io.opencensus.implcore.stats.StatsManager$StatsEvent.process(StatsManager.java:101)
    at io.opencensus.impl.internal.DisruptorEventQueue$DisruptorEventHandler.onEvent(DisruptorEventQueue.java:229)
    at io.opencensus.impl.internal.DisruptorEventQueue$DisruptorEventHandler.onEvent(DisruptorEventQueue.java:222)
    at com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:168)
    ... 2 more

If this happens, all publisher threads eventually will hang forever, since there is no longer a consumer of the LMAX Disruptor queue. It seems like the cause of this is clock skew causing measurements to occur before the bucket start, and there is even a TODO in the code to handle this: https://github.com/census-instrumentation/opencensus-java/blob/master/impl_core/src/main/java/io/opencensus/implcore/stats/MutableViewData.java#L307

I'm not sure what the correct behavior should be when the measurement timestamp is before the bucket start though, drop the event possibly?