Closed istathar closed 9 years ago
I've seen this a bit in the past...my recollection is that it's from code inlined from cereal
, and getVarintPrefixedBS
doesn't do much else. I think the alternatives would be to fix cereal
, switch to binary
or use kmett's package that abstracts the two
The suspect function is here if you want to check it out, I'm also open to other suggestions. Are you decoding a lot of strings?
Yeah I am. The protobufs are (pretty . show . contrived) something like
datacenter:lhr1, hostname:secure.example.org, metric:eth0-tx-bytes 1390810279171907000 15301
The numbers are just a few bytes varint and or fixed64 encoded, but the key value pairs are about a hundred bytes and yeah, string heavy.
AfC
How much of an impact is this having on your app? I'd like to get this sorted out but I'm short of time at the moment.
1,000 data points 0.23 seconds; 10,000 data points 9.6 seconds.
Ouch. Profiling shows it to be about 75% of time and 92% of the allocation burden.
fmap Data.Serialize.Get 301 10000 0.0 0.0 75.4 92.5
fmap.\ Data.Serialize.Get 302 40002 0.0 0.0 75.4 92.5
getWord8 Data.Serialize.Get 312 0 0.0 0.0 75.4 92.4
>>= Data.Serialize.Get 313 50001 0.0 0.0 75.4 92.4
>>=.\ Data.Serialize.Get 314 70002 0.1 0.0 75.4 92.4
demandInput.\ Data.Serialize.Get 322 1 0.0 0.0 0.0 0.0
mplus Data.Serialize.Get 323 0 0.0 0.0 0.0 0.0
mplus.\ Data.Serialize.Get 324 0 0.0 0.0 0.0 0.0
mplus.\.kf' Data.Serialize.Get 325 1 0.0 0.0 0.0 0.0
unGet Data.Serialize.Get 326 1 0.0 0.0 0.0 0.0
>>=.\.ks' Data.Serialize.Get 316 130000 75.2 92.3 75.3 92.4
unGet Data.Serialize.Get 318 130000 0.0 0.0 0.0 0.0
return Data.Serialize.Get 317 70001 0.0 0.0 0.1 0.1
return.\ Data.Serialize.Get 319 70000 0.0 0.0 0.1 0.1
fmap.\.ks' Data.Serialize.Get 320 40000 0.1 0.1 0.1 0.1
unGet Data.Serialize.Get 315 70003 0.0 0.0 0.0 0.0
So it's on the Get path. cereal has a Builder which it employs for the work of building ByteStrings on the encode path, but seems to be doing something horrible on the decode one. We're looking at it further now. Off hand I can't see anything stupid that protobuf is doing, but overall the heapmap shows we're getting a steady leak, especially in PINNED. That's a bit of a surprise.
AfC
I've taken a look at this, it appears that cereal is simply slow due to it's continuation passing nature and the fact that each bind creates a new continuation.
This problem is easy to replicate with any repeated field.
Here is an example that shows the issue:
{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Data.ProtocolBuffers
import GHC.Generics(Generic)
import Data.ByteString(ByteString)
import qualified Data.ByteString as B
import Data.TypeLevel(D1)
import Data.Serialize
data Burst = Burst {
frames :: Repeated D1 (Value ByteString)
} deriving (Generic, Eq, Show)
instance Encode Burst
instance Decode Burst
encodeBurst :: [ByteString] -> ByteString
encodeBurst = runPut . encodeMessage . Burst . putField
decodeBurst :: ByteString -> Either String Burst
decodeBurst = runGet decodeMessage
main :: IO ()
main = do
let encoded = encodeBurst $ replicate 30000 "hai"
print $ B.length encoded
let decoded = decodeBurst encoded
either error (const $ return ()) decoded
Disreguard that handwavey statement. I've found the real issue. I will provide free performance in a pull request shortly. Feel free to actually implement the fix however you like. The issue is with list traversal on insertion to the hashmap.
Let me know if this fixes your perf issue. I'll cut a new release if it does.
@NathanHowell definite improvement. We're down to 1.6 seconds from 9.6! @christian-marie is definitely on the right track. That same part of cereal is still costing about 24% of time, though. Makes me wonder if there's another thunk leak happening somewhere.
Definitely possible, the issue I've seen was from Value Text
fields iirc, not Repeated x
. If it's not from cereal
it's probably excess laziness. I'm going to leave this issue open for now, but I did cut a new release (v0.1.3 8b2333f35cf18e814fc1be06c3dc087af64f2dfb) with @christian-marie's patches.
I'm going to give hs-packer a go as soon as we can get the missing features into it:
https://github.com/vincenthz/hs-packer/pull/2 https://github.com/vincenthz/hs-packer/pull/3
I've heard there is a big rewrite coming for binary... and it also has enough features (as of 0.6.x iirc) to work for protobuf too. I think it is usually faster than cereal. If hs-packer works and is fast, I'm certainly open to taking it on.
I'm putting hs-packer on the back-burner until someone can work this out:
https://github.com/vincenthz/hs-packer/issues/5
I did implement the decode path, though that wasn't an order of magnitude faster. You can see that here:
be0a7a25d361e80b833385482b1f177292a2c65c
I believe that the allocation of thousands of haskell objects for each element of a repeated field might be the bottleneck here, not the serialization library.
Because of HashMap Tag [WireField]
in the intermediate decoding step?
I'm not entirely sure. I suspect so though.
Running a large decode with +RTS -hy shows a lot of WireField, HashMap and ByteString allocation.
My bet is that if you can think of a faster way to do that, you'll probably nail the performance issues.
The other obvious way is to use a static (per type) HashMap of setters instead of a HashMap of WireFields.. I've thought about doing this before, may still be necessary to batch repeated field updates to minimize record copies (and GC time). I was originally thinking it would be faster to use a HashMap because of this but haven't tried it. I can hack up an implementation later in the week unless you want to take a stab at it. On Feb 24, 2014 1:40 PM, "Christian Marie" notifications@github.com wrote:
I'm not entirely sure. I suspect so though.
Running a large decode with +RTS -hy shows a lot of WireField, HashMap and ByteString allocation.
My bet is that if can think of a faster way to do that, you'll probably nail the performance issues.
Reply to this email directly or view it on GitHubhttps://github.com/alphaHeavy/protobuf/issues/4#issuecomment-35942482 .
A simple test, decoding 2.9MB of repeated four byte strings: http://ponies.io/raw/protobuf-hy.pdf
{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Data.ProtocolBuffers
import GHC.Generics(Generic)
import Data.ByteString(ByteString)
import qualified Data.ByteString as B
import Data.TypeLevel(D1)
import Data.Serialize
data Burst = Burst {
frames :: Repeated D1 (Value ByteString)
} deriving (Generic, Eq, Show)
instance Encode Burst
instance Decode Burst
encodeBurst :: [ByteString] -> ByteString
encodeBurst = runPut . encodeMessage . Burst . putField
decodeBurst :: ByteString -> Either String Burst
decodeBurst = runGet decodeMessage
main :: IO ()
main = do
Right m <- decodeBurst `fmap` B.readFile "encoded"
print m
I'll note that the HashMap allocations are missing in that graph, probably because the only thing growing in this test case is one key (a list).
Also interestingly: ARR_WORDS, which I am pretty sure is the underlying bytestring data buffer (2.9MB worth) stays pretty constant. So, that's good. I don't think there's much we can do about the ByteString container allocations unless we're doing more than we need to.
Which only leaves WireField and [] containers to be optimized.
As for taking a stab at this alternative implementation you mention. I'd be happy to but I'm not going to be able to prioritize this until next week either. Maybe we can poke each-other again then. Otherwise, if you can describe the implementation in a little more detail I'll probably get around to trying it in my own time out of curiosity.
Hi Nathan, poke.
This is bugging us again, we'd like it to be faster. Any suggestions on where our efforts should be directed to cut down on these allocations next?
Is the concern allocation or runtime? Reading 500k ByteString
s into a list has a similar heap profile, with the exception of WireField.DelimitedField
.
quickDecode :: Get [ByteString]
quickDecode = go [] (512*1024) where
go xs 0 = return (reverse xs)
go xs i = do
x <- getBytes 4
go (x:xs) (i-1)
alternativeDecode :: Get [ByteString]
alternativeDecode = reverse <$> go [] where
go xs = do
mx <- Just <$> getBytes 4 <|> pure Nothing
case mx of
Just x -> go (x:xs)
Nothing -> pure xs
decodeBurst :: ByteString -> Either String Burst
decodeBurst = runGet decodeMessage
Our concern is specifically the speed of decoding many delimited fields, and decoding many small messages in series.
We are seeing this kind of thing currently:
COST CENTRE MODULE %time %alloc
>>=.\.ks' Data.Serialize.Get 50.8 58.3
>>=.\ Data.Serialize.Get 15.8 18.4
fmap.\.ks' Data.Serialize.Get 7.7 15.1
Where, I think ks is the success continuation bind? I.e. mfield <- getWireField
I'm thinking that reducing the allocations by trying to box things as little as possible will decrease the runtime. Our bottom line is reducing CPU usage and runtime, memory is cheap as far as we're concerned. But allocations are expensive.
The motivation behind this is that we are currently bottlenecked almost completely on decoding protobufs.
Yeah, the expensive >>=.\.ks'
is this one.
The implementation in protobuf decodes a message into a HashMap Tag [WireField]
, and then runs a second step to finish parsing the WireField
values and set them in the message record. This is faster for messages with a relatively large number of fields, because HashMap
has better insert performance than copying a record. But it's also going to keep the WireField
values around for a while, so they're likely to get aged out of gen0 and be more expensive to collect.
An alternative implementation with much better cache locality and improved GC friendliness (most/all intermediate parses should be collected early in gen0) is to do the field conversion and setting during decode... encoding is done in a similar fashion already. The decoding implementation would start with something like this:
messageFields
:: (Generic msg, GDecode (Rep msg))
=> proxy msg
-> HashMap Tag (Setter' msg WireField)
Then just fold a default message over a stream of (Tag, WireField)
and let the setters fill in the record. There would need to be some additional work for required fields, but otherwise it should be pretty straightforward.
And what I alluded to a while back... to avoid allocating a WireField
during parsing its constructors need to be lifted up a level:
data Decoders msg = Decoders {
varintFieldDecoder :: HashMap Tag (Setter' msg Word64)
fixed64FieldDecoder :: HashMap Tag (Setter' msg Word64)
delimitedFieldDecoder :: HashMap Tag (Setter' msg ByteString)
-- startFieldDecoder
-- endFieldDecoder
fixed32FieldDecoder :: HashMap Tag (Setter' msg Word32)
}
This is should be fixed by #19, many thanks to @joshbohde.
Nice, thanks @joshbohde. I might try to re-run my test here and see how much it's improved: https://github.com/alphaHeavy/protobuf/issues/4#issuecomment-35944192
I'm not sure if this rings any bells, but we've finally reached the point of doing some profiling on our application and there's one standout:
getVarintPrefixedBS
is doing the bulk of the allocation (fine) but taking an awfully long time to do it.Not sure if that rings any bells, but I thought I'd ask and see if there's anything obvious you know about there that we should have a look at.
AfC