Writing binary strings into eventlog

maoe commented 4 years ago

Author of ghc-trace-events here. I'm not sure how those ot2 events are used but if they're meant to be processed by some tools other than ThreadScope or ghc-events show, I'd use traceBinaryEventIO with proper data types. It'd be much nicer to read/write and much more compact in output size (which in turns means faster to write) because it doesn't have to be a null-terminated UTF-8 string.

The only caveat of using this API is that it's only available for GHC >= 8.8.

yaitskov commented 4 years ago

@maoe , thanks very much. Actually I was looking for such method and thinking about Bit Twiddling Hacks like below.

Btw, is there any plan for exposing some kind of batching API?

I was doing benchmarks and noticed that performance deteriorates if messages are small (less than 1kb) having same integral size (1mb) I guess most of messages would be much less than 1kb. So we could benefit from that. Managing batching outside of RTS would be hard. Some sort of monad wrapping is required. e.g. wai has lots of small messages.

    setTag sp "span.kind" "server"
    setTag sp "component" "http"
    setTag sp "http.method" $ requestMethod req
    setTag sp "http.target" $ rawPathInfo req
    setTag sp "http.flavor" $ BS8.pack $ show (httpVersion req)
    app req $ \resp -> do
      setTag sp "http.status_code" (BS8.pack $ show $ statusCode $ responseStatus resp)

joinBsLong :: BS.ByteString -> Word64 -> IO BS.ByteString
joinBsLong bs1 u64 = do
  let !delims = 1 + 1
  let !totalLen = delims + (BS.length bs1) + (2 + 16)
  -- pin oder nicht pin? if pin then how to pin bs1 and bs2?
  outBufMba <- newByteArray totalLen
  let !outBufBs = mbaToBs outBufMba
  _ <- [C.block|
    int {
      char * p = $bs-ptr:outBufBs;
      // first arg
      memcpy(p, $bs-ptr:bs1, $bs-len:bs1);
      p += $bs-len:bs1;
      // second arg
      memcpy(p, " 0x", 3);
      p += 3;
      *(uint64_t*)p = ($(uint64_t u64)  & 0x0F0F0F0F0f0f0f0fuL)       + 0x3030303030303030uL;
      p += 8;
      *(uint64_t*)p = (($(uint64_t u64) & 0xF0F0F0F0f0f0f0f0uL) >> 4) + 0x3030303030303030uL;
      return 0;
    }
  |]
  return outBufBs

yaitskov commented 4 years ago

@ethercrow, I decoupled eventlog interpreter from ascii parser to reuse it in binary mode. Plus true binary log writer instead of ascii via bytestring. Intermediate state - good for review till PR is not big.

Last step is to add binary parser and test write/parse

maoe commented 4 years ago

Sorry for the delayed response. I’ll take a look at the benchmark later but GHC RTS does have buffers to batch messages. I don’t remember the buffer size nor if it’s configurable though.

yaitskov commented 4 years ago

@ethercrow ,

I completed tests and wrote simple writing benchmark to measure improvement

With turned off event log 100 times faster. What if introduce dummy module with same api then it would be easier to disable tracing per module basis.

ot-write-benchmark
benchmarking in text   mode
time                 12.23 μs   (12.17 μs .. 12.30 μs)
mean                 12.33 μs   (12.26 μs .. 12.44 μs)
std dev              276.4 ns   (185.0 ns .. 459.8 ns)
variance introduced by outliers: 23% (moderately inflated)

benchmarking in binary mode
time                 123.9 ns   (123.4 ns .. 124.6 ns)
mean                 124.0 ns   (123.6 ns .. 125.9 ns)
std dev              2.383 ns   (1.002 ns .. 4.917 ns)
variance introduced by outliers: 26% (moderately inflated)

With enabled event log 8 times faster. So actual slow down in eventlog itself.

ot-write-benchmark +RTS -l -oltrace.eventlog
benchmarking in text   mode
time                 13.09 μs   (13.03 μs .. 13.15 μs)
mean                 13.07 μs   (13.03 μs .. 13.11 μs)
std dev              130.8 ns   (108.6 ns .. 161.8 ns)

benchmarking in binary mode
time                 1.667 μs   (1.664 μs .. 1.669 μs)
mean                 1.669 μs   (1.665 μs .. 1.672 μs)
std dev              10.82 ns   (8.821 ns .. 14.00 ns)

maoe commented 4 years ago

Regarding the buffering in the eventlog writer, GHC RTS allocates # of capabilities + 1 buffers and each buffer is 2MB. See https://gitlab.haskell.org/ghc/ghc/-/blob/568d7279a80cf945271f0659f11a94eea3f1433d/rts/eventlog/EventLog.c#L33.

Each event has some overhead. The header (2 bytes for event type and 4 bytes for timestamp) and 2 bytes for payload size, which sum to 8 bytes in total. So extremely short messages tend to have non-negligible overhead for throughput.

As for the WAI example, I was expecting some like below to make better use of the binary event logging:

data WaiEvent
  = WaiRequestEvent { method, target :: !ByteString, flavor :: !HttpVersion }
  | WaiResponseEvent { statusCode :: !Int }
instance Binary WaiEvent where ...

setTag :: (Binary a, MonadIO m) => SpanInFlight -> a -> m ()

then in the middleware

middleware app = ...
  withSpan "WAI handler" $ \sp -> do
    setTag sp WaiRequestEvent { .. }
    app req $ \resp -> do
      setTag sp WaiResponseEvent { statusCode = statusCode $ responseStatus resp }
      sendResp resp

I'm not sure what the differences between begin/endSpan, setTag, and addEvents are but if those should also be encoded in the eventlog, one may define a data type for those as well.

The point is that with binary logging events don't have to be represented in a textual format. More efficient binary format can be used.

yaitskov commented 4 years ago

@maoe, I don't understand why ghc-trace-events requires strict ByteString? ByteString Builder supports only lazy ByteString as output. ghc-trace-events requires strict one, I guess, because there is a mutable C buffer where you can copy efficiently a strict ByteString to, but take into account that an application developer gets lazy ByteString from the builder and he has to make it strict i.e. to copy chunks of lazy byte string into flat region, which would be copied again by traceBinaryEventIO! Double work.

Plus eventlog feature can be disabled, but strict ByteString API requires to copy bytes nonetheless or think about RTS internals checking whether logging enabled.

Second. I continued research for buffering data to minimize access to traceBinaryEventIO and today I noticed yet another 6x speed improvement with following hack.

{-# NOINLINE mm #-}
mm :: MVar (M.Map ThreadId (IORef (Int, Builder)))  
mm = unsafePerformIO (newMVar M.empty)  

traceBuilder :: MonadIO m => (Int, Builder) -> m ()
traceBuilder sizedBuilder@(size, builder) = do
  liftIO $ do
    mtid <- myThreadId
    m <- readMVar mm
    case m M.!? mtid of
      Nothing -> do
             newRef <- newIORef sizedBuilder
             modifyMVar mm (\m' -> return (M.insert mtid newRef m', ()))
      Just bsRef -> do
             (n, bs) <- readIORef bsRef
             if n + size > 3000
             then do
               traceBinaryEventIO . LBS.toStrict $ toLazyByteString bs
               writeIORef bsRef sizedBuilder
             else do
               writeIORef bsRef (n + size, builder <> bs)

With above buffering

time                 248.5 ns   (246.4 ns .. 251.3 ns)
                     0.999 R²   (0.997 R² .. 1.000 R²)
mean                 248.6 ns   (247.1 ns .. 252.1 ns)
std dev              7.531 ns   (3.089 ns .. 13.66 ns)

Writing every time

time                 1.680 μs   (1.677 μs .. 1.683 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.679 μs   (1.675 μs .. 1.683 μs)
std dev              12.13 ns   (10.24 ns .. 14.38 ns)

It looks super ugly, but it works. Such hack looks better on low level. Probably thread locals could make better, but I wasn't able to find such feature in GHC. I think if traceBinaryEventIO prefer Builder instead of strict ByteString it would be even faster.

yaitskov commented 4 years ago

Regarding this PR. It is complete. Binary format is used for messages. There is big space for write speed improvement, but it is out of scope for this ticket. Plus PR is already pretty big.

maoe commented 4 years ago

@yaitskov

I'm not sure if I should continue the discussion here because it's out of scope.. but I'll reply here:

I don't understand why ghc-trace-events requires strict ByteString? ... ghc-trace-events requires strict one, I guess, because there is a mutable C buffer where you can copy efficiently a strict ByteString to

Yes, that's almost right. ghc-trace-events requires a strict ByteString as input because the underlying primop (traceBinaryEvent#) takes a contiguous memory block as input, and as far as I know the underlying buffers (note there are multiple buffers in the RTS; a buffer per capability + a buffer which is not tied to any capabilities) aren't exposed in the RTS API. So there's no way to write events directly into any of the buffers. The primop always copies a byte chunk to one of the buffers.

but take into account that an application developer gets lazy ByteString from the builder and he has to make it strict i.e. to copy chunks of lazy byte string into flat region, which would be copied again by traceBinaryEventIO! Double work.

Given the limitations of the GHC RTS API, the conversion from the builder into a contiguous memory block needs to be done at some point before passing the chunk to the primop. I think it's best to do it in application rather than in ghc-trace-events, because the extra copy can be avoided if you carefully choose an AllocationStrategy such that the Builder can be turned into a single chunk of lazy ByteString. In such case toStrict doesn't copy the chunk and just returns it. ghc-trace-events doesn't know what the best chunk sizes for the application are.

Plus eventlog feature can be disabled, but strict ByteString API requires to copy bytes nonetheless or think about RTS internals checking whether logging enabled.

If user events are disabled in the RTS, ghc-trace-events' tracing functions are no-op. They are lazy in the ByteString argument so no strict ByteStrings are constructed unless you force it.

Second. I continued research for buffering data to minimize access to traceBinaryEventIO and today I noticed yet another 6x speed improvement with following hack.

This hack loses the timestamp of each event so it's a trade-off between correctness vs. throughput. Is it fine for the telemetry usage?

ethercrow / opentelemetry-haskell

Writing binary strings into eventlog #19