Open archagon opened 6 years ago
To clarify, the array of messages will be used as the backing store for a string CRDT, so every byte and every millisecond counts.
There currently isn't a way to avoid the _StorageClass
at the moment.
Even if there was, wouldn't Array
still be likely to do a heap allocation under the hood? i.e. - it can't put every sized array on the stack, and surely not ones with 100,000+ elements.
Internally, SwiftProtobuf can generate structs with or without a backing StorageClass. The decision for when to generate a StorageClass is somewhat arbitrary at the moment; we'd appreciate input into how to make this better, especially if that input is based on reproducible performance measurements.
There are two motivations for the StorageClass design:
Unknown field handling is another complication. At one time, proto3 did not preserve unknown fields (in Google's C++ implementation), but this was changed. Some of the implementations have an option to suppress unknown field handling, and we could consider adding such an option.
Even if there was, wouldn't Array still be likely to do a heap allocation under the hood? i.e. - it can't put every sized array on the stack, and surely not ones with 100,000+ elements.
Yes, but only one allocation — not 100,001!
Internally, SwiftProtobuf can generate structs with or without a backing StorageClass. The decision for when to generate a StorageClass is somewhat arbitrary at the moment; we'd appreciate input into how to make this better, especially if that input is based on reproducible performance measurements.
I've been doing some (very, very unscientific) benchmarking for my particular use case. My messages are laid out as follows, and I've also hand-written a native struct with the same spec:
message Atom {
required ID id = 1;
required ID cause = 2;
oneof value {
uint32 insert = 3;
bool delete = 4;
}
}
message ID {
required uint32 luid = 1;
required uint32 lamport = 2;
required uint32 index = 3;
}
With a million-element array of each, proto takes up 5.7x more space in memory than native (153MB vs. 27MB) and is 12.5x worse when doing the equivalent to calloc (275ms vs. 22ms), 3.4x worse when doing a straight copy (67ms vs. 20ms), 10x worse with random access (3ms vs. 0.3ms), and about the same with random insertions (1ms vs. 1.2ms). On the other hand, encoding and decoding are quite quick and produce data with good size compared to the simple binary coder I'm using. It's certainly usable, but not ideal. That size difference in particular is very troubling.
I tried manually fiddling with the generated Swift file to remove StorageClass
from the generated structs. The resulting code seems to work fine, but is very strangely slower in almost every respect than the heap-allocated code! Need to dig into this more, doesn't make much sense to me.
Glad to hear that there's nothing intrinsic about StorageClass
w.r.t. message coding. This means that, at the very least, a flag could be set to make everything a struct if the user knows what they're doing. Same with unknown field handling, if it's indeed what's causing the size to balloon so much. (Could data
in UnknownStorage
be made a class?)
OK, I've made a test build of protoc-gen-swift with _StorageClass
disabled, and also turned UnknownStorage
into a stub class. Now performance is mostly correlated to struct size. The new proto Atom
structs are about 2x the size of my native structs, and performance is around 2x worse than native in every comparison. Re-enabling _StorageClass
while keeping UnknownStorage
disabled makes some things a lot slower than native (calloc) and some things surprisingly faster (straight copy, random insertion), though maybe I'm not doing a good enough job of clobbering the cache between tests, or making sure that all the data is actually copied.
Heap or not, I really think something needs to be done about that UnknownStorage
memory.
(FWIW, the reason for that remaining 2x size difference is that the proto structs get poorly packed on account of the optionals. Three UInt32
s take up 12 bytes, while three UInt32?
s take up 21 bytes. I wonder if the required
field rule could be used to optimize this? Manually removing the optionals in ID
seems to have no ill effects. EDIT: Figured out that proto3 does not require the optionals. Now everything is quite compact.)
Oops, yeah, my copy tests weren't actually copying anything on the _StorageClass
side due to the _uniqueStorage
trick. With the test corrected and using the current protobuf code, straight copies are 20x (!!!) slower than native, 401ms vs. 22ms; with _StorageClass
disabled and UnknownStorage
nerfed, only 2x slower than native. (And naturally, the heap penalty will cascade with additional nested messages.)
For dealing with UnknownStorage
, would something like this work? (It passes all the tests.)
public struct UnknownStorage: Equatable {
public static let emptyData = NSMutableData()
/// The raw protocol buffer binary-encoded bytes that represent the unknown
/// fields of a decoded message.
internal var _data: NSMutableData = emptyData
internal mutating func _uniqueData() -> NSMutableData {
if !isKnownUniquelyReferenced(&_data) {
_data = NSMutableData(data: _data as Data)
}
return _data
}
public var data: Data { return _data as Data }
public static func ==(lhs: UnknownStorage, rhs: UnknownStorage) -> Bool {
return lhs.data == rhs.data
}
public init() {}
internal mutating func append(protobufData: Data) {
_uniqueData().append(protobufData)
}
public func traverse<V: Visitor>(visitor: inout V) throws {
if !data.isEmpty {
try visitor.visitUnknown(bytes: data)
}
}
}
Testing with a ~1 million element array of 27-byte structs, I see 1.6x better memory use over the current approach. (125,044 kb vs. 78,140 kb.) Still a long way to go for near-native sizing, though. (27,448 kb.) I am able to get there by removing UnknownStorage
altogether, but that won't be good enough for many use cases.
EDIT: Although, encoding speed now suffers quite a bit now because of the bridging. This hack seems to fix it. I presume there exists a better way to treat Data
as a reference type and not a value type, but I'm not aware of it.
public struct UnknownStorage: Equatable {
internal class DataReference {
var data: Data
init(data: Data) {
self.data = data
}
}
internal static let emptyData = DataReference(data: Internal.emptyData)
/// The raw protocol buffer binary-encoded bytes that represent the unknown
/// fields of a decoded message.
internal var _data: DataReference = emptyData
internal mutating func _uniqueData() -> DataReference {
if !isKnownUniquelyReferenced(&_data) {
_data = DataReference(data: _data.data)
}
return _data
}
public var data: Data { return _data.data }
public static func ==(lhs: UnknownStorage, rhs: UnknownStorage) -> Bool {
return lhs.data == rhs.data
}
public init() {}
internal mutating func append(protobufData: Data) {
_uniqueData().data.append(protobufData)
}
public func traverse<V: Visitor>(visitor: inout V) throws {
if !data.isEmpty {
try visitor.visitUnknown(bytes: data)
}
}
}
Sorry, didn't get a chance to really think about this until today.
I was going to say the bridging hit of using Foundation could be non ideal, and I wasn't sure if that could cause any hiccup for folks on linux. But with your edit, you are basically doing what we did with the _StorageClass
support but for the data itself.
In the cases where the message is already using storage, it might be possible to just remap the unknownFields
into the storage, that way it never ends up at two heap references for those messages. But that likely means sticking a protocol on the unknownFields
of the message so it can use something like the above or use something that shims over to something inline in the _StorageClass
.
Would that really make a huge difference, though? Standard Data
does a heap allocation under the hood anyway, and the DataReference
approach would only require one tiny additional heap allocation for the "pointer". Even if you store the Data
in _StorageClass
, you'll still have the two heap allocations.
I looked at the implementation for Data
and I think I see the problem. Here's the relevant part:
public struct Data {
@_versioned internal var _backing : _DataStorage
@_versioned internal var _sliceRange: Range<Index>
}
In a nutshell, Foundation's Data
has three values in it: A pointer to the backing storage, and two integers representing the first and last valid byte offsets in that storage. I presume this was done to make it very fast to return a sub-Data object: you don't need to copy the backing storage, just create a new struct with different first/last offsets.
Your proposed versions all have a single pointer and no other data. That basically removes two integers, cutting a 24-byte struct down to just 8 bytes. That's significant for small messages such as yours, especially in the common case where there are no unknown fields.
In your first version (using NSMutableData
), you could try making the _data
field optional. An optional class field uses no more space (in C terms, it simply allows the pointer to be NULL) but it's faster to check for data == nil
than to check for data.isEmpty
in the traverse
method.
Would that really make a huge difference, though? Standard Data does a heap allocation under the hood anyway, and the DataReference approach would only require one tiny additional heap allocation for the "pointer". Even if you store the Data in _StorageClass, you'll still have the two heap allocations.
If you count the allocation within Data, then yes, this model would always be three allocs instead of two; I was just trying to focus on the what the code was directly do.
As far as cost of those little ones. When we did some of the original benchmarking/profiling, the use of a single static instance for default storage and the static empty data and strings, came from some of the numbers @tbkka ran for benchmarks.
Since your original concern was the heap overhead when having lots of instances, it would seem like that logic could still be applied to the management of unknown fields. All it takes is an example like yours where it mostly avoids the heap, but a server gets revised and a new field starts getting sent. All clients that haven't been updated (since those tend to be at the mercy of end user habits to update their apps), would need the extra heap allocations for your thousands of objects, and performance would suddenly suffer.
Again, not saying it has to fold into the _StorageClass, just pointing out some of your initial concerns could just as equally apply to when unknownFields are needed.
Worth noting: Your Atom
message has < 16 primitive fields including the two nested ID
messages. So there is an argument for not using a StorageClass
in this case for Atom
; the overall structure (including nested messages) satisfies our criteria for a "small" message.
Unfortunately, I don't have time right now to dig into the possible code generator changes myself.
Worth noting: Your Atom message has < 16 primitive fields including the two nested ID messages. So there is an argument for not using a StorageClass in this case for Atom; the overall structure (including nested messages) satisfies our criteria for a "small" message.
The current check is also that any single message/group field causes storage. This likely could be loosened to checking instead for transitive recursion through messages if we wanted.
Spin this out into #735, if we can nail down what we want the new logic to be over there, I should be able to make the changes pretty quickly.
@archagon -- Could you try your tests with the changes proposed in #900 ? I'm curious how much of a difference it really makes.
Circling back to this -
Data
specifically, there's been some pondering about breaking the dependency on Foundation. If bytes
fields got modeled as [UInt8]
that would also help on the specific problem mentioned here with Data
.
I'm trying to use protobufs with fairly large (100,000+ element) arrays of structs. These arrays receive dozens of changes a second and are serialized and deserialized every few seconds, so they must remain efficient and deterministic at all times: contiguous memory, minimal heap use, structs all the way down. Unfortunately (unless I'm missing something) it seems that the current implementation of swift-protobuf is incapable of handling this case to any degree of complexity, since any nested messages automatically invoke the
_StorageClass
path and cause every array element to allocate on the heap. One could splay out the nested messages onto the root message, but you'd still run into the 16 field limit. Even worse: according to the profiler, the protobuf array uses about 6x the memory of an equivalent native array. (Not sure if this is accurate... maybe something to do with thatInternal.emptyData
inunknownFields
?)Is there any way around this?