Current use of serializedDataSize() in serializedData() causes additional runtime linear in the depth of the proto hierarchy

apple / swift-protobuf

Plugin and runtime library for using protobuf with Swift

Apache License 2.0

4.57k stars 452 forks source link

Current use of serializedDataSize() in serializedData() causes additional runtime linear in the depth of the proto hierarchy #713

Open MrMage opened 6 years ago

MrMage commented 6 years ago

SwiftProtobuf 1.0.2, all platforms

(Disclaimer: I know that Protobuf is not meant for large data structures, but I figured it wouldn't hurt to ask about this anyway.)

I have something similar to the following protobuf hierarchy:

message Wrapper {
    Wrapped wrapped = 1;
}

message Wrapped {
    repeated string entry = 1; // the actual structure is a bit more complex, but that shouldn't really make a difference
}

In my case, calling serializedData() on a message of type Wrapped takes about 0.75 seconds (it contains a lot of data), of which about 0.25 seconds go towards computing the value of wrapped.serializedDataSize().

Calling serializedData() on a message of type Wrapper wrapping the same Wrapped instance above takes about 1 second, of which about 0.25 seconds go towards computing the value of wrapper.serializedDataSize() on the highest (Wrapper) level and another 0.25 seconds to wrapped.serializedDataSize() (the same as above). The serialized size of wrapper, on the other hand is just 5 bytes more than that of wrapped. Each additional level would introduce another ~0.25 seconds of serializedDataSize(), essentially computing the same size over and over again.

If I change Wrapper to

message Wrapper {
    bytes wrapped = 1;
}

and manually store a serialized representation of wrapped in there, this takes about 0.75 seconds to encode Wrapped (as before), plus a negligible amount of overhead for copying that serialized representation around (much less than ~0.25 seconds).

This means that for cases in which serializedDataSize() is potentially costly, introducing extra hierarchy levels into my protobuf model can cause significant increases in computation time.

The C++ implementation avoids this problem by caching serialized sizes, as evidenced by the statement "Most of these are just simple wrappers around ByteSize() and SerializeWithCachedSizes()." on https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.message_lite#serialization.

Given that manipulating messages that are in the process of being serialized sounds like a bad idea anyway, I wonder if it would be possible introduce similar caching into SwiftProtobuf?

This could either be done by storing an extra cachedSize field in each message (see https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.message_lite#MessageLite.GetCachedSize.details). That would introduce an extra 8 bytes of overhead per message, though. (Edit: might be problematic because this would mutate the struct. Might be used by applying the nonmutating keyword, though.)

Alternatively, it should be possible to create a temporary (dictionary?) cache in serializedData() that is then passed to the recursive calls of serializedData() and serializedDataSize() on the children. As cache keys, it should be possible to use the pointers to the individual Message structs as their memory layout is not going to change during serialization.
Option 2 (possibly even more efficient) for cache keys would be the pointer to the individual _StorageClass members, as caching the size is only relevant for protos that contain child protos (which seems to be the condition for the existence). (Edit: option 2 might be problematic because useHeapStorage is only true for messages containing single message fields, not repeated ones.)

Either approach (caching the computed size in-line or in a temporary dictionary) would be an implementation detail and should not affect the library's consumers.

I hope this makes sense — would love to hear your thoughts on this.

thomasvl commented 6 years ago

We sorta knew the string ops could be expensive, string costs are also a problem in the parsing/generation of TextFormat and JSON. We've done some work on optimizing it, but with Swift revisiting the String api (and internals), knew it would be something that likely would change so until that work is done, it didn't seem worth focusing on too much. Switch the field to bytes avoids all that string work, so the speed up there makes sense.

I'm pretty familiar with the trick the C++ uses; and it is something I've wondering about. The C++ works by using atomic ops to get/set/clear the cached values to deal with threading issues. The cache they use is at the message level, so in your example writing out a Wrapped would always be expensive, but writing out a Wrapper does get some caching because of the wrapped field value getting the caching.

On the Swift side we have some added complexity. Messages are value types, based on the fields in the object, we do sometimes use a class internally and provide out own copy on write/etc. When we have that reference type internally, it might be possible to stick a cached size in there, but for the message that don't have it, we'd have to make the serialization apis mutating to internally include the cache. It is likely worth mentioning for the cases where we have a reference type internally, we actually did some profile and work to avoid that always being an allocation as that showed up as something that slowed things down, so always having cache storage could see the same performance impacts due to the allocation. Using some side storage instead for the cache is possible, but the catch likely would be the key; since messages are value types, what do we use? Turning the message themselves into something that could be the key seems like it could end up being even more expensive.

The core of the problem is writing anything length delimited in the the binary form. Since the length is written in varint format, you end up needing length before you actually can write the value. So anything where the length has to be computed ends up being more expensive – Strings and message fields. There is another way to deal with this problem, serialize in reverse order, so you don't need the length until after you have written out the data itself. But this can also lead to lots of reallocs if you don't have a buffer class that can use multiple block of memory that end up being exposed as continuous. This also has a second issue, if you are writing backwards, you can never put a stream interface for serialization; so to write directly to a file/network/etc. and hence avoid haivng to make the full blob in memory. Right now the library doesn't have any streaming interfaces, but it is something we said we'd likely to revisit after we got 1.0 done; so we've been hesitant to invest in an optimization like that since it wouldn't be usable for streaming.

Anyway, yes, something we have put some thought into, but don't have a solution for this yet. And as mentioned, unlike the C++ cache that is per message, Swift might benefit from something different because of the costs around computing a the actual data to write out for a string.

MrMage commented 6 years ago

Thank you for the quick reply! (Fellow (Ex-)Googler from the Munich office here :-) )

First, let me clarify that this is only about binary serialization — I expect plaint-text or JSON serialization to always be inefficient compared to binary.

I'm pretty familiar with the trick the C++ uses; and it is something I've wondering about. The C++ works by using atomic ops to get/set/clear the cached values to deal with threading issues. The cache they use is at the message level, so in your example writing out a Wrapped would always be expensive, but writing out a Wrapper does get some caching because of the wrapped field value getting the caching.

As far as I can tell, even making those operations atomic wouldn't be sufficient for thread-safety — modifying a message on one thread while it is being serialized on another probably always leads to problems.

Given that, we can probably assume the message to not be mutated (i.e. "frozen") while we are serializing it. In that case, using the pointers to the individual sub-messages should work — they are globally unique (obviously), and they won't change during serialization (see above). Also, if we are using pointers, hashing performance should not become a problem, given that we don't need to hash the entire message's contents. (In fact, we can use the pointer as both the key and the key's hash.)

So if I'm not missing something here, having a local and temporary (to this particular serialization operation) size cache should work.

Let me know if I'm missing something here and/or whether you would be open to exploring this, in which case I might try my luck with a proof-of-concept (of course, no expectations as to whether that will actually be merged).

allevato commented 6 years ago

In that case, using the pointers to the individual sub-messages should work — they are globally unique (obviously)

Our messages are value types; Swift doesn't let you get long-lived pointers to them. In fact, since value types in Swift by definition don't have an identity, there's no unique "thing" we can get that would act as a reference to any specific message.

MrMage commented 6 years ago

I know, but we do not need long-lived pointers. We just need them to stay valid during serialization. If we assume that the message is not being mutated during serialization (which is already implicitly assumed by the current implementation), those pointers should not change during serialization. Or am I missing something here?

allevato commented 6 years ago

By "long-lived", I mean outside of a capturing closure. So the logic to cache and serialize each submessage would need to take place inside of a block passed to unsafeAddressOf(&msg) { ... }. Since unsafeAddressOf requires an inout argument, the message must therefore also be mutable, which means we'd have to declare a copy of it:

var copy = msg
unsafeAddressOf(&copy) { ptr in
  // cache and serialize
}

So you're basically incurring repeated copies for each nesting level that you have messages, because you have to copy from the top down.

MrMage commented 6 years ago

Ouch, you are right! I thought there was a way to at least obtain a pointer to any struct that would stay valid as long as the struct would not be touched (just without any guarantees further than that), similar to C++. Thank you for the elaboration. For now, I will resort to building and serializing a binary-compatible

message Wrapper_Serialization {
    bytes wrapped = 1;
}

message for serializing. (The deserialization case is unaffected anyway.)

tbkka commented 6 years ago

Binary protobuf encoding does have an intrinsic O(n^2) factor on the depth of message nesting because of the need to know the size of each message before it's written.

The C++ implementation takes advantage of reference semantics to cache these sizes as they're computed. Our current implementation uses inline structs for certain small messages which makes this more complicated. We could force the generator to build storage classes everywhere which would make small messages slower to handle by themselves but simplify this kind of optimization, or possibly add the cached size value to the generated inline structs.

But there might be other approaches we could explore as well:

If binary serialization was done depth-first, we would end up copying messages instead of recomputing sizes. This is essentially the same as @MrMage's workaround. (I haven't thought this through carefully, though. Literal copying could be even worse in some cases; avoiding the copies would require segmented storage.)
One reason we need to know the size up front because the size is stored as a varint. If we just allocated the maximum length for that varint (I think 5 bytes is the max), then we could recursively encode everything and then go back and fill in the sizes. This would waste a few bytes for each submessage but avoid the need to accurately size them.
A refinement of the above would introduce a fast size "overestimate". That would let us reserve fewer than 5 bytes in cases where we can determine a maximum size very quickly.

Of course, we've talked a lot about streaming encode: The most straightforward way to do that does absolutely require you to know the exact size of each message before you write it out.

thomasvl commented 6 years ago

From what was being said about timings using string vs. bytes, I tend to think over estimating might not really help as we'd still be doing the string conversions twice and is sounding like the most expensive part for their messages.

tbkka commented 6 years ago

I think you can bound the UTF-8 size as a multiple of the character count without actually doing any conversion. That would likely suffice for a fast overestimate.

thomasvl commented 6 years ago

I think you can bound the UTF-8 size as a multiple of the character count without actually doing any conversion. That would likely suffice for a fast overestimate.

But for a string in a message in a message you still need the real size of the inner message when you write the bytes. The overestimate approach is only good for oversizing the buffer; when writing the bytes, you need the real sizes, and that problem is still the recursive double eval problem. The inner message goes in length delimited so it needs to know its size which means knowing the real size of the string. No?

tbkka commented 6 years ago

... when writing the bytes, you need the real sizes, and that problem is still the recursive double eval ...

Yes, you need both the bytes and the real size of those bytes. But you don't need the size before you generate the bytes. You can reserve space for the size (which only requires a bound on the real size), build the bytes, and then back-patch the exact size after you've built the bytes (at which point you do know the exact size).

MrMage commented 6 years ago

We could force the generator to build storage classes everywhere which would make small messages slower to handle by themselves but simplify this kind of optimization

Another option would be to add this optimization for structs with a class-based storage to the runtime, and maybe add a protobuf compiler option to force generating classes for all structs? Then each user could benchmark and decide for themselves whether that would be worth it for them, and they would automatically get the benefits for protos that already use class-based storage in the current implementation. I think this might be something worth looking into.

That being said, in this case Swift's inability to provide stable struct pointers is definitely a downside.

or possibly add the cached size value to the generated inline structs.

That would also be an option. The cached size value could be stored as a four-byte integer — larger sizes are not supported by protobuf, anyway, and if we did encounter them, we could simply not cache larger sizes than MAX_INT by convention. In fact, it might be sufficient to use a two-byte cached size, given that this problem is mostly relevant for the case of many small individual messages (at least in my case, the outer proto contains ~800k inner ones). Alignment might nullify these space savings, though.

If binary serialization was done depth-first, we would end up copying messages instead of recomputing sizes. This is essentially the same as @MrMage's workaround. (I haven't thought this through carefully, though. Literal copying could be even worse in some cases; avoiding the copies would require segmented storage.)

I can imagine that copying around all these serialized bytes could still be faster than re-computing the size over and over again (at least in my workaround, the copy adds a ~20ms overhead compared to ~250ms for computing the size). This could cause allocation contention and heap fragmentation though, especially when serializing many small messages. Also, for a proto hierarchy of depth N, the leaves would essentially still get copied N times, which would probably be costly again.

One reason we need to know the size up front because the size is stored as a varint. If we just allocated the maximum length for that varint (I think 5 bytes is the max), then we could recursively encode everything and then go back and fill in the sizes. This would waste a few bytes for each submessage but avoid the need to accurately size them.

But don't we also need the size to allocate a byte buffer of the proper size for the serialized data? (Might be possible to avoid this with an auto-growing buffer, though.)

MrMage commented 6 years ago

Following up on this — is there any interest in at least a partial optimization for structs with class-based storage? If so, I might look into trying to implement that.

tbkka commented 6 years ago

@MrMage: If you have time to spend on it, please give it a shot. I'd be interested to see how it works out.

MrMage commented 6 years ago

@tbkka: I did some experiments over the weekend. You can view my (experimental) modifications and corresponding branch here:

https://github.com/MrMage/swift-protobuf/commit/86eac0027f1ad31142d211cba1a72268484c7cd4 https://github.com/MrMage/swift-protobuf/tree/optimize-binary-serialization

I've also uploaded the sample code I used for benchmarking to https://github.com/MrMage/swift-protobuf-bench.

Unfortunately, the results are disappointing so far:

without caching
1) serialize without wrapper: 0.567915397999968 s
serialize with wrapper: 0.758822700999644 s
serialize with fast wrapper: 0.58575516799965 s
deserialize wrapped: 0.37208873399959 s

2) with caching for messages with storage classes
serialize without wrapper: 1.18878716799918 s
serialize with wrapper: 1.09215202099949 s
serialize with fast wrapper: 1.06510823499957 s
deserialize wrapped: 0.426660219000041 s

3) with caching, storage classes enabled for all messages
serialize without wrapper: 1.00786826999956 s
serialize with wrapper: 0.982758261000527 s
serialize with fast wrapper: 1.01990470100009 s
deserialize wrapped: 0.416009259999555 s

(No idea why "serialize with wrapper" is faster than "serialize without wrapper" in 2). Also, I think I did actually enable storage classes for all messages in case 2) already, but the results are bad in either case.)

As you can see, when enabling storage classes for all messages, the caching does save some serialization time for the wrapped proto, but unfortunately the size cache lookups appear to be slower than just computing the size on the fly, at least for small messages. Any ideas on how this could be optimized further would be much appreciated. Otherwise, the only viable alternative I see would be to cache sizes in-line with the message, possibly enabled by a Swift-specific protoc compiler option.

By the way, the tests with storage classes enabled for all messages were done by replacing

let useHeapStorage = isAnyMessage || descriptor.fields.count > 16 || hasSingleMessageField(descriptor: descriptor)

with

let useHeapStorage = true || isAnyMessage || descriptor.fields.count > 16 || hasSingleMessageField(descriptor: descriptor)

in the MessageGenerator class.

thomasvl commented 6 years ago

Try putting some timing within the library, I still think the slow part is likely to be around the operations on the string itself as that's the major change between the bytes and string messages. If the majority of the time is within string, then caches are only going to do so much for us.

tbkka commented 6 years ago

@thomasvl has a good point: String fields are going to be the slowest for size measurement, so the cache is likely to have the biggest impact there. Try adding some string fields to your benchmark proto and let us know if that affects things any.

seivan commented 6 years ago

Could improvements be done by including simd (or something that also works on Linux) for some of the arithmetic and gain some optimisations there instead? I'm also having an issue with things being a bit slow.

MrMage commented 6 years ago

That's not the bottleneck; each individual size computation is fast enough. The problem is duplicating the work on each level

@tbkka FYI, my benchmark proto already consisted of ~50% strings. The best approach might be to add a 4-byte size cache field similar to the C++ implementation, but that would of course have significant side-effects; for example we'd need to document that only one size calculation should run at a time, and it would also increase the amount of memory consumed by each message.