Improve performance by using binary

bohde commented 9 years ago

This pull request is backwards incompatible, but by using binary's Get and a custom Builder, we can get an average 3-5x better performance, with some cases being 10x+. I don't know if this is a path you'd consider, but wanted to let you know about it.

We were experimenting with using protobuf instead of aeson for encoding one of our data types, and found that it was taking longer for both encoding and decoding. A simplified version of this data type is in bench/Nested.hs.

A quick change to use binary instead of cereal improved the decode speed, but neither Data.Binary.Put nor Data.Binary.Builder was that much faster.

Profiling revealed that encoding Messages was expensive, because we'd need to encode the message fully, calculate the length, and then write both. If we're using the Builder interface, with it's Monoid instance to build results, then we can memoize the length of the resulting ByteString during construction by using a type (Sum, Builder). Because every type we serialize in a protobuf either has a statically known length, or is length prefixed, this adds a minor amount of calculation in the worst case, but a nice improvement in the best case. A version of this builder is included under src/Data/Binary/Builder/Sized.hs

I've added benchmarks for this branch, with corresponding benchmarks for master in https://github.com/joshbohde/protobuf/tree/benchmark . They include the Burst data type from #4, as well. Results run on my laptop are here: https://gist.github.com/joshbohde/2b7b763805ddb2c6bfd1. Here they are in a table:

benchmark              master    binary    pct change  throughput increase
burst/encoding/1       2.648 μs  520.7 ns  -80.34      4.09
burst/encoding/10      17.06 μs  2.995 μs  -82.44      4.7
burst/encoding/100     155.3 μs  26.99 μs  -82.62      4.75
burst/encoding/1000    1.702 ms  377.4 μs  -77.83      3.51
burst/decoding/1       3.501 μs  759.1 ns  -78.32      3.61
burst/decoding/10      22.46 μs  3.374 μs  -84.98      5.66
burst/decoding/100     191.3 μs  29.81 μs  -84.42      5.42
burst/decoding/1000    2.028 ms  414.0 μs  -79.59      3.9
nested/encoding/a1     26.53 μs  2.414 μs  -90.9       9.99
nested/encoding/a10    189.3 μs  14.50 μs  -92.34      12.06
nested/encoding/a100   1.660 ms  155.7 μs  -90.62      9.66
nested/encoding/a1000  15.82 ms  3.731 ms  -76.42      3.24
nested/encoding/b10    271.1 μs  20.49 μs  -92.44      12.23
nested/encoding/b100   2.524 ms  264.2 μs  -89.53      8.55
nested/encoding/b1000  25.74 ms  5.751 ms  -77.66      3.48
nested/decoding/a1     35.99 μs  1.011 μs  -97.19      34.6
nested/decoding/a10    244.2 μs  1.071 μs  -99.56      227.0
nested/decoding/a100   2.438 ms  1.079 μs  -99.96      2258.5
nested/decoding/a1000  24.89 ms  3.765 μs  -99.98      6609.89
nested/decoding/b10    353.8 μs  3.705 μs  -98.95      94.49
nested/decoding/b100   3.666 ms  29.94 μs  -99.18      121.4
nested/decoding/b1000  37.96 ms  446.8 μs  -98.82      83.96

I'm suspicious of some of these numbers, but can't find anything obviously wrong with the benchmarks.

Internally, we're seeing about a 100x speedup in decoding, and a 10x speedup in encoding of real data using this branch.

NathanHowell commented 9 years ago

I'm just about to head out to Bayhac and I'll review the patch sometime today. I'd also like to see some heap profiles.

Back when we first started work on protobuf, binary was missing necessary functionality that cereal had, I think it was lacking an incremental parser. The last attempt (a year or two ago) at switching didn't show that much of performance difference, but they've been doing a lot more work on binary so I'm not surprised the decodes are magically faster.

bohde commented 9 years ago

Using the following code, modified for respective branch:

{-# LANGUAGE DataKinds #-}
{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE OverloadedStrings #-}
module Main where

import Data
import Nested
import qualified Data.ByteString as BS

writeBurst :: IO ()
writeBurst = do
  BS.writeFile "burst.pb" $ enc $ burst 1000000

readBurst :: IO ()
readBurst = do
  a <- BS.readFile "burst.pb"
  print $ decBurst a

writeNested :: IO ()
writeNested = do
  BS.writeFile "nested.pb" $ enc $ nested 100 1000

readNested :: IO ()
readNested = do
  a <- BS.readFile "nested.pb"
  print $ decNested a

main = readNested

Write Burst * Master * Binary
Read Burst * Master * Binary
Write Nested * Master * Binary
Read Nested * Master * Binary

NathanHowell commented 9 years ago

Looks like an easy decision then eh? :+1: This is awesome.

alphaHeavy / protobuf

Improve performance by using binary #19