alphaHeavy / protobuf

An implementation of Google's Protocol Buffers in Haskell.
http://hackage.haskell.org/package/protobuf
BSD 3-Clause "New" or "Revised" License
96 stars 18 forks source link

Suspicious amount of time and allocation on decode path #4

Closed istathar closed 9 years ago

istathar commented 10 years ago

I'm not sure if this rings any bells, but we've finally reached the point of doing some profiling on our application and there's one standout: getVarintPrefixedBS is doing the bulk of the allocation (fine) but taking an awfully long time to do it.

                                                                                             individual     inherited
COST CENTRE                 MODULE                                 no.     entries  %time %alloc   %time %alloc  ticks     bytes 
getVarintPrefixedBS         Data.ProtocolBuffers.Wire              264           0   77.7   57.8    77.7   57.8   1974 2900040432

Not sure if that rings any bells, but I thought I'd ask and see if there's anything obvious you know about there that we should have a look at.

AfC

NathanHowell commented 10 years ago

I've seen this a bit in the past...my recollection is that it's from code inlined from cereal, and getVarintPrefixedBS doesn't do much else. I think the alternatives would be to fix cereal, switch to binary or use kmett's package that abstracts the two

NathanHowell commented 10 years ago

The suspect function is here if you want to check it out, I'm also open to other suggestions. Are you decoding a lot of strings?

istathar commented 10 years ago

Yeah I am. The protobufs are (pretty . show . contrived) something like

datacenter:lhr1, hostname:secure.example.org, metric:eth0-tx-bytes 1390810279171907000 15301

The numbers are just a few bytes varint and or fixed64 encoded, but the key value pairs are about a hundred bytes and yeah, string heavy.

AfC

NathanHowell commented 10 years ago

How much of an impact is this having on your app? I'd like to get this sorted out but I'm short of time at the moment.

istathar commented 10 years ago

  1,000 data points 0.23 seconds; 10,000 data points 9.6 seconds.

Ouch. Profiling shows it to be about 75% of time and 92% of the allocation burden.

      fmap                    Data.Serialize.Get                     301       10000    0.0    0.0    75.4   92.5
       fmap.\                 Data.Serialize.Get                     302       40002    0.0    0.0    75.4   92.5
        getWord8              Data.Serialize.Get                     312           0    0.0    0.0    75.4   92.4
         >>=                  Data.Serialize.Get                     313       50001    0.0    0.0    75.4   92.4
          >>=.\               Data.Serialize.Get                     314       70002    0.1    0.0    75.4   92.4
           demandInput.\      Data.Serialize.Get                     322           1    0.0    0.0     0.0    0.0
            mplus             Data.Serialize.Get                     323           0    0.0    0.0     0.0    0.0
             mplus.\          Data.Serialize.Get                     324           0    0.0    0.0     0.0    0.0
              mplus.\.kf'     Data.Serialize.Get                     325           1    0.0    0.0     0.0    0.0
               unGet          Data.Serialize.Get                     326           1    0.0    0.0     0.0    0.0
           >>=.\.ks'          Data.Serialize.Get                     316      130000   75.2   92.3    75.3   92.4
            unGet             Data.Serialize.Get                     318      130000    0.0    0.0     0.0    0.0
            return            Data.Serialize.Get                     317       70001    0.0    0.0     0.1    0.1
             return.\         Data.Serialize.Get                     319       70000    0.0    0.0     0.1    0.1
              fmap.\.ks'      Data.Serialize.Get                     320       40000    0.1    0.1     0.1    0.1
           unGet              Data.Serialize.Get                     315       70003    0.0    0.0     0.0    0.0

So it's on the Get path. cereal has a Builder which it employs for the work of building ByteStrings on the encode path, but seems to be doing something horrible on the decode one. We're looking at it further now. Off hand I can't see anything stupid that protobuf is doing, but overall the heapmap shows we're getting a steady leak, especially in PINNED. That's a bit of a surprise.

AfC

christian-marie commented 10 years ago

I've taken a look at this, it appears that cereal is simply slow due to it's continuation passing nature and the fact that each bind creates a new continuation.

This problem is easy to replicate with any repeated field.

Here is an example that shows the issue:

{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Data.ProtocolBuffers
import GHC.Generics(Generic)
import Data.ByteString(ByteString)
import qualified Data.ByteString as B
import Data.TypeLevel(D1)
import Data.Serialize

data Burst = Burst {
    frames :: Repeated D1 (Value ByteString)
} deriving (Generic, Eq, Show)

instance Encode Burst
instance Decode Burst

encodeBurst :: [ByteString] -> ByteString
encodeBurst = runPut . encodeMessage . Burst . putField

decodeBurst :: ByteString -> Either String Burst
decodeBurst = runGet decodeMessage

main :: IO ()
main = do
    let encoded = encodeBurst $ replicate 30000 "hai"
    print $ B.length encoded

    let decoded = decodeBurst encoded
    either error (const $ return ()) decoded
christian-marie commented 10 years ago

Disreguard that handwavey statement. I've found the real issue. I will provide free performance in a pull request shortly. Feel free to actually implement the fix however you like. The issue is with list traversal on insertion to the hashmap.

NathanHowell commented 10 years ago

Let me know if this fixes your perf issue. I'll cut a new release if it does.

istathar commented 10 years ago

@NathanHowell definite improvement. We're down to 1.6 seconds from 9.6! @christian-marie is definitely on the right track. That same part of cereal is still costing about 24% of time, though. Makes me wonder if there's another thunk leak happening somewhere.

NathanHowell commented 10 years ago

Definitely possible, the issue I've seen was from Value Text fields iirc, not Repeated x. If it's not from cereal it's probably excess laziness. I'm going to leave this issue open for now, but I did cut a new release (v0.1.3 8b2333f35cf18e814fc1be06c3dc087af64f2dfb) with @christian-marie's patches.

christian-marie commented 10 years ago

I'm going to give hs-packer a go as soon as we can get the missing features into it:

https://github.com/vincenthz/hs-packer/pull/2 https://github.com/vincenthz/hs-packer/pull/3

NathanHowell commented 10 years ago

I've heard there is a big rewrite coming for binary... and it also has enough features (as of 0.6.x iirc) to work for protobuf too. I think it is usually faster than cereal. If hs-packer works and is fast, I'm certainly open to taking it on.

christian-marie commented 10 years ago

I'm putting hs-packer on the back-burner until someone can work this out:

https://github.com/vincenthz/hs-packer/issues/5

I did implement the decode path, though that wasn't an order of magnitude faster. You can see that here:

be0a7a25d361e80b833385482b1f177292a2c65c

I believe that the allocation of thousands of haskell objects for each element of a repeated field might be the bottleneck here, not the serialization library.

NathanHowell commented 10 years ago

Because of HashMap Tag [WireField] in the intermediate decoding step?

christian-marie commented 10 years ago

I'm not entirely sure. I suspect so though.

Running a large decode with +RTS -hy shows a lot of WireField, HashMap and ByteString allocation.

My bet is that if you can think of a faster way to do that, you'll probably nail the performance issues.

NathanHowell commented 10 years ago

The other obvious way is to use a static (per type) HashMap of setters instead of a HashMap of WireFields.. I've thought about doing this before, may still be necessary to batch repeated field updates to minimize record copies (and GC time). I was originally thinking it would be faster to use a HashMap because of this but haven't tried it. I can hack up an implementation later in the week unless you want to take a stab at it. On Feb 24, 2014 1:40 PM, "Christian Marie" notifications@github.com wrote:

I'm not entirely sure. I suspect so though.

Running a large decode with +RTS -hy shows a lot of WireField, HashMap and ByteString allocation.

My bet is that if can think of a faster way to do that, you'll probably nail the performance issues.

Reply to this email directly or view it on GitHubhttps://github.com/alphaHeavy/protobuf/issues/4#issuecomment-35942482 .

christian-marie commented 10 years ago

A simple test, decoding 2.9MB of repeated four byte strings: http://ponies.io/raw/protobuf-hy.pdf

{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Data.ProtocolBuffers
import GHC.Generics(Generic)
import Data.ByteString(ByteString)
import qualified Data.ByteString as B
import Data.TypeLevel(D1)
import Data.Serialize

data Burst = Burst {
    frames :: Repeated D1 (Value ByteString)
} deriving (Generic, Eq, Show)

instance Encode Burst
instance Decode Burst

encodeBurst :: [ByteString] -> ByteString
encodeBurst = runPut . encodeMessage . Burst . putField

decodeBurst :: ByteString -> Either String Burst
decodeBurst = runGet decodeMessage

main :: IO ()
main = do
    Right m <- decodeBurst `fmap` B.readFile "encoded"
    print m
christian-marie commented 10 years ago

I'll note that the HashMap allocations are missing in that graph, probably because the only thing growing in this test case is one key (a list).

christian-marie commented 10 years ago

Also interestingly: ARR_WORDS, which I am pretty sure is the underlying bytestring data buffer (2.9MB worth) stays pretty constant. So, that's good. I don't think there's much we can do about the ByteString container allocations unless we're doing more than we need to.

Which only leaves WireField and [] containers to be optimized.

As for taking a stab at this alternative implementation you mention. I'd be happy to but I'm not going to be able to prioritize this until next week either. Maybe we can poke each-other again then. Otherwise, if you can describe the implementation in a little more detail I'll probably get around to trying it in my own time out of curiosity.

christian-marie commented 10 years ago

Hi Nathan, poke.

This is bugging us again, we'd like it to be faster. Any suggestions on where our efforts should be directed to cut down on these allocations next?

NathanHowell commented 10 years ago

Is the concern allocation or runtime? Reading 500k ByteStrings into a list has a similar heap profile, with the exception of WireField.DelimitedField.

quickDecode :: Get [ByteString]
quickDecode = go [] (512*1024) where
  go xs 0 = return (reverse xs)
  go xs i = do
   x <- getBytes 4
   go (x:xs) (i-1)

image

alternativeDecode :: Get [ByteString]
alternativeDecode = reverse <$> go [] where
  go xs = do
   mx <- Just <$> getBytes 4 <|> pure Nothing
   case mx of
     Just x -> go (x:xs)
     Nothing -> pure xs

image

decodeBurst :: ByteString -> Either String Burst
decodeBurst = runGet decodeMessage

image

christian-marie commented 10 years ago

Our concern is specifically the speed of decoding many delimited fields, and decoding many small messages in series.

We are seeing this kind of thing currently:

COST CENTRE MODULE             %time %alloc

>>=.\.ks'   Data.Serialize.Get  50.8   58.3
>>=.\       Data.Serialize.Get  15.8   18.4
fmap.\.ks'  Data.Serialize.Get   7.7   15.1

Where, I think ks is the success continuation bind? I.e. mfield <- getWireField

I'm thinking that reducing the allocations by trying to box things as little as possible will decrease the runtime. Our bottom line is reducing CPU usage and runtime, memory is cheap as far as we're concerned. But allocations are expensive.

The motivation behind this is that we are currently bottlenecked almost completely on decoding protobufs.

NathanHowell commented 10 years ago

Yeah, the expensive >>=.\.ks' is this one.

The implementation in protobuf decodes a message into a HashMap Tag [WireField], and then runs a second step to finish parsing the WireField values and set them in the message record. This is faster for messages with a relatively large number of fields, because HashMap has better insert performance than copying a record. But it's also going to keep the WireField values around for a while, so they're likely to get aged out of gen0 and be more expensive to collect.

An alternative implementation with much better cache locality and improved GC friendliness (most/all intermediate parses should be collected early in gen0) is to do the field conversion and setting during decode... encoding is done in a similar fashion already. The decoding implementation would start with something like this:

messageFields
  :: (Generic msg, GDecode (Rep msg))
  => proxy msg
  -> HashMap Tag (Setter' msg WireField)

Then just fold a default message over a stream of (Tag, WireField) and let the setters fill in the record. There would need to be some additional work for required fields, but otherwise it should be pretty straightforward.

NathanHowell commented 10 years ago

And what I alluded to a while back... to avoid allocating a WireField during parsing its constructors need to be lifted up a level:

data Decoders msg = Decoders {
  varintFieldDecoder    :: HashMap Tag (Setter' msg Word64)
  fixed64FieldDecoder   :: HashMap Tag (Setter' msg Word64)
  delimitedFieldDecoder :: HashMap Tag (Setter' msg ByteString)
  -- startFieldDecoder
  -- endFieldDecoder
  fixed32FieldDecoder   :: HashMap Tag (Setter' msg Word32)
}
NathanHowell commented 9 years ago

This is should be fixed by #19, many thanks to @joshbohde.

christian-marie commented 9 years ago

Nice, thanks @joshbohde. I might try to re-run my test here and see how much it's improved: https://github.com/alphaHeavy/protobuf/issues/4#issuecomment-35944192