apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.48k stars 1.28k forks source link

Tracing should not rely on the system clock #7414

Open richardstartin opened 3 years ago

richardstartin commented 3 years ago

There's code throughout the codebase which performs tracing of sensitive operations, such as in BaseOperator:

  @Override
  public final T nextBlock() {
    if (Thread.interrupted()) {
      throw new EarlyTerminationException();
    }
    if (TraceContext.traceEnabled()) {
      long start = System.currentTimeMillis();
      T nextBlock = getNextBlock();
      long end = System.currentTimeMillis();
      String operatorName = getOperatorName();
      LOGGER.trace("Time spent in {}: {}", operatorName, (end - start));
      TraceContext.logTime(operatorName, (end - start));
      return nextBlock;
    } else {
      return getNextBlock();
    }
  }

Calls to System.currentTimeMillis() call in to the system clock, so are subject to NTP/PTP adjustments, meaning that (end - start) can be negative or misleadingly long. System.nanoTime() should be used instead, and converted to milliseconds if so desired.

kishoreg commented 3 years ago

Good point. We have this in many places.

mcvsubbu commented 3 years ago

I had the same thought as well, way early on in Pinot development. However, when Adwait (a former pinot team member) and I researched a bit, we came across some articles that effectively said System.currentTimeMillis() was more efficient than nanos. I remember I was surprised because I thought getting nanos would be far cheaper, usually a relative clock that does not reach across cpus. Perhaps the performance penalty is platform dependent, I am not sure.

richardstartin commented 3 years ago

System.currentTimeMillis() was more efficient than nanos

This is the first time I have heard that and I must admit I am very skeptical about those articles' claims. On linux, System.currentTimeMillis delegates to CLOCK_REALTIME (see here) whereas System.nanoTime() delegates to CLOCK_MONOTONIC (here). There are faster clocks than CLOCK_MONOTONIC but Java users don't have access to them. Performance wise, the dominating factor in time measurement tends to be whether the clock_gettime syscall is mapped into userspace by vDSO or not, and e.g. Xen doesn't do that, which can confound measurement.

richardstartin commented 3 years ago

I just recalled that JFR uses RDTSC so can measure time more efficiently, so replacing manual tracing with JFR events might be a good idea if performance is a concern.