Closed yaitskov closed 4 years ago
@maoe , thanks very much. Actually I was looking for such method and thinking about Bit Twiddling Hacks like below.
Btw, is there any plan for exposing some kind of batching API?
I was doing benchmarks and noticed that performance deteriorates if messages are small (less than 1kb) having same integral size (1mb) I guess most of messages would be much less than 1kb. So we could benefit from that. Managing batching outside of RTS would be hard. Some sort of monad wrapping is required. e.g. wai has lots of small messages.
setTag sp "span.kind" "server"
setTag sp "component" "http"
setTag sp "http.method" $ requestMethod req
setTag sp "http.target" $ rawPathInfo req
setTag sp "http.flavor" $ BS8.pack $ show (httpVersion req)
app req $ \resp -> do
setTag sp "http.status_code" (BS8.pack $ show $ statusCode $ responseStatus resp)
joinBsLong :: BS.ByteString -> Word64 -> IO BS.ByteString
joinBsLong bs1 u64 = do
let !delims = 1 + 1
let !totalLen = delims + (BS.length bs1) + (2 + 16)
-- pin oder nicht pin? if pin then how to pin bs1 and bs2?
outBufMba <- newByteArray totalLen
let !outBufBs = mbaToBs outBufMba
_ <- [C.block|
int {
char * p = $bs-ptr:outBufBs;
// first arg
memcpy(p, $bs-ptr:bs1, $bs-len:bs1);
p += $bs-len:bs1;
// second arg
memcpy(p, " 0x", 3);
p += 3;
*(uint64_t*)p = ($(uint64_t u64) & 0x0F0F0F0F0f0f0f0fuL) + 0x3030303030303030uL;
p += 8;
*(uint64_t*)p = (($(uint64_t u64) & 0xF0F0F0F0f0f0f0f0uL) >> 4) + 0x3030303030303030uL;
return 0;
}
|]
return outBufBs
@ethercrow, I decoupled eventlog interpreter from ascii parser to reuse it in binary mode. Plus true binary log writer instead of ascii via bytestring. Intermediate state - good for review till PR is not big.
Last step is to add binary parser and test write/parse
Sorry for the delayed response. I’ll take a look at the benchmark later but GHC RTS does have buffers to batch messages. I don’t remember the buffer size nor if it’s configurable though.
@ethercrow ,
I completed tests and wrote simple writing benchmark to measure improvement
With turned off event log 100 times faster. What if introduce dummy module with same api then it would be easier to disable tracing per module basis.
ot-write-benchmark
benchmarking in text mode
time 12.23 μs (12.17 μs .. 12.30 μs)
mean 12.33 μs (12.26 μs .. 12.44 μs)
std dev 276.4 ns (185.0 ns .. 459.8 ns)
variance introduced by outliers: 23% (moderately inflated)
benchmarking in binary mode
time 123.9 ns (123.4 ns .. 124.6 ns)
mean 124.0 ns (123.6 ns .. 125.9 ns)
std dev 2.383 ns (1.002 ns .. 4.917 ns)
variance introduced by outliers: 26% (moderately inflated)
With enabled event log 8 times faster. So actual slow down in eventlog itself.
ot-write-benchmark +RTS -l -oltrace.eventlog
benchmarking in text mode
time 13.09 μs (13.03 μs .. 13.15 μs)
mean 13.07 μs (13.03 μs .. 13.11 μs)
std dev 130.8 ns (108.6 ns .. 161.8 ns)
benchmarking in binary mode
time 1.667 μs (1.664 μs .. 1.669 μs)
mean 1.669 μs (1.665 μs .. 1.672 μs)
std dev 10.82 ns (8.821 ns .. 14.00 ns)
Regarding the buffering in the eventlog writer, GHC RTS allocates # of capabilities + 1 buffers and each buffer is 2MB. See https://gitlab.haskell.org/ghc/ghc/-/blob/568d7279a80cf945271f0659f11a94eea3f1433d/rts/eventlog/EventLog.c#L33.
Each event has some overhead. The header (2 bytes for event type and 4 bytes for timestamp) and 2 bytes for payload size, which sum to 8 bytes in total. So extremely short messages tend to have non-negligible overhead for throughput.
As for the WAI example, I was expecting some like below to make better use of the binary event logging:
data WaiEvent
= WaiRequestEvent { method, target :: !ByteString, flavor :: !HttpVersion }
| WaiResponseEvent { statusCode :: !Int }
instance Binary WaiEvent where ...
setTag :: (Binary a, MonadIO m) => SpanInFlight -> a -> m ()
then in the middleware
middleware app = ...
withSpan "WAI handler" $ \sp -> do
setTag sp WaiRequestEvent { .. }
app req $ \resp -> do
setTag sp WaiResponseEvent { statusCode = statusCode $ responseStatus resp }
sendResp resp
I'm not sure what the differences between begin/endSpan, setTag, and addEvents are but if those should also be encoded in the eventlog, one may define a data type for those as well.
The point is that with binary logging events don't have to be represented in a textual format. More efficient binary format can be used.
@maoe, I don't understand why ghc-trace-events requires strict ByteString? ByteString Builder supports only lazy ByteString as output. ghc-trace-events requires strict one, I guess, because there is a mutable C buffer where you can copy efficiently a strict ByteString to, but take into account that an application developer gets lazy ByteString from the builder and he has to make it strict i.e. to copy chunks of lazy byte string into flat region, which would be copied again by traceBinaryEventIO! Double work.
Plus eventlog feature can be disabled, but strict ByteString API requires to copy bytes nonetheless or think about RTS internals checking whether logging enabled.
Second. I continued research for buffering data to minimize access to traceBinaryEventIO and today I noticed yet another 6x speed improvement with following hack.
{-# NOINLINE mm #-}
mm :: MVar (M.Map ThreadId (IORef (Int, Builder)))
mm = unsafePerformIO (newMVar M.empty)
traceBuilder :: MonadIO m => (Int, Builder) -> m ()
traceBuilder sizedBuilder@(size, builder) = do
liftIO $ do
mtid <- myThreadId
m <- readMVar mm
case m M.!? mtid of
Nothing -> do
newRef <- newIORef sizedBuilder
modifyMVar mm (\m' -> return (M.insert mtid newRef m', ()))
Just bsRef -> do
(n, bs) <- readIORef bsRef
if n + size > 3000
then do
traceBinaryEventIO . LBS.toStrict $ toLazyByteString bs
writeIORef bsRef sizedBuilder
else do
writeIORef bsRef (n + size, builder <> bs)
With above buffering
time 248.5 ns (246.4 ns .. 251.3 ns)
0.999 R² (0.997 R² .. 1.000 R²)
mean 248.6 ns (247.1 ns .. 252.1 ns)
std dev 7.531 ns (3.089 ns .. 13.66 ns)
Writing every time
time 1.680 μs (1.677 μs .. 1.683 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 1.679 μs (1.675 μs .. 1.683 μs)
std dev 12.13 ns (10.24 ns .. 14.38 ns)
It looks super ugly, but it works. Such hack looks better on low level. Probably thread locals could make better, but I wasn't able to find such feature in GHC. I think if traceBinaryEventIO prefer Builder instead of strict ByteString it would be even faster.
Regarding this PR. It is complete. Binary format is used for messages. There is big space for write speed improvement, but it is out of scope for this ticket. Plus PR is already pretty big.
@yaitskov
I'm not sure if I should continue the discussion here because it's out of scope.. but I'll reply here:
I don't understand why ghc-trace-events requires strict ByteString? ... ghc-trace-events requires strict one, I guess, because there is a mutable C buffer where you can copy efficiently a strict ByteString to
Yes, that's almost right. ghc-trace-events requires a strict ByteString as input because the underlying primop (traceBinaryEvent#
) takes a contiguous memory block as input, and as far as I know the underlying buffers (note there are multiple buffers in the RTS; a buffer per capability + a buffer which is not tied to any capabilities) aren't exposed in the RTS API. So there's no way to write events directly into any of the buffers. The primop always copies a byte chunk to one of the buffers.
but take into account that an application developer gets lazy ByteString from the builder and he has to make it strict i.e. to copy chunks of lazy byte string into flat region, which would be copied again by traceBinaryEventIO! Double work.
Given the limitations of the GHC RTS API, the conversion from the builder into a contiguous memory block needs to be done at some point before passing the chunk to the primop. I think it's best to do it in application rather than in ghc-trace-events, because the extra copy can be avoided if you carefully choose an AllocationStrategy such that the Builder can be turned into a single chunk of lazy ByteString. In such case toStrict
doesn't copy the chunk and just returns it. ghc-trace-events doesn't know what the best chunk sizes for the application are.
Plus eventlog feature can be disabled, but strict ByteString API requires to copy bytes nonetheless or think about RTS internals checking whether logging enabled.
If user events are disabled in the RTS, ghc-trace-events' tracing functions are no-op. They are lazy in the ByteString argument so no strict ByteStrings are constructed unless you force it.
Second. I continued research for buffering data to minimize access to traceBinaryEventIO and today I noticed yet another 6x speed improvement with following hack.
This hack loses the timestamp of each event so it's a trade-off between correctness vs. throughput. Is it fine for the telemetry usage?
Author of ghc-trace-events here. I'm not sure how those
ot2
events are used but if they're meant to be processed by some tools other than ThreadScope orghc-events show
, I'd use traceBinaryEventIO with proper data types. It'd be much nicer to read/write and much more compact in output size (which in turns means faster to write) because it doesn't have to be a null-terminated UTF-8 string.The only caveat of using this API is that it's only available for GHC >= 8.8.