apple / swift-protobuf

Plugin and runtime library for using protobuf with Swift
Apache License 2.0
4.55k stars 443 forks source link

_StorageClass makes large arrays of non-trivial messages impractical. #733

Open archagon opened 6 years ago

archagon commented 6 years ago

I'm trying to use protobufs with fairly large (100,000+ element) arrays of structs. These arrays receive dozens of changes a second and are serialized and deserialized every few seconds, so they must remain efficient and deterministic at all times: contiguous memory, minimal heap use, structs all the way down. Unfortunately (unless I'm missing something) it seems that the current implementation of swift-protobuf is incapable of handling this case to any degree of complexity, since any nested messages automatically invoke the _StorageClass path and cause every array element to allocate on the heap. One could splay out the nested messages onto the root message, but you'd still run into the 16 field limit. Even worse: according to the profiler, the protobuf array uses about 6x the memory of an equivalent native array. (Not sure if this is accurate... maybe something to do with that Internal.emptyData in unknownFields?)

Is there any way around this?

archagon commented 6 years ago

To clarify, the array of messages will be used as the backing store for a string CRDT, so every byte and every millisecond counts.

thomasvl commented 6 years ago

There currently isn't a way to avoid the _StorageClass at the moment.

Even if there was, wouldn't Array still be likely to do a heap allocation under the hood? i.e. - it can't put every sized array on the stack, and surely not ones with 100,000+ elements.

tbkka commented 6 years ago

Internally, SwiftProtobuf can generate structs with or without a backing StorageClass. The decision for when to generate a StorageClass is somewhat arbitrary at the moment; we'd appreciate input into how to make this better, especially if that input is based on reproducible performance measurements.

There are two motivations for the StorageClass design:

  1. Proto allows circular references and a pure struct-based implementation cannot support that. At one time, I did experiment with code to analyze the full tree of data structures in order to use StorageClass only when there was in fact a circular reference. That code didn't ship simply because I wasn't comfortable that I understood all of the implications well enough to justify the complexity, but the approach could be resurrected.
  2. Structs sometimes need to be copied. For the most part, Swift does a good job of avoiding this, but copying a small wrapper struct (plus associated refcounting) is often cheaper than copying a large struct.

Unknown field handling is another complication. At one time, proto3 did not preserve unknown fields (in Google's C++ implementation), but this was changed. Some of the implementations have an option to suppress unknown field handling, and we could consider adding such an option.

archagon commented 6 years ago

Even if there was, wouldn't Array still be likely to do a heap allocation under the hood? i.e. - it can't put every sized array on the stack, and surely not ones with 100,000+ elements.

Yes, but only one allocation — not 100,001!

Internally, SwiftProtobuf can generate structs with or without a backing StorageClass. The decision for when to generate a StorageClass is somewhat arbitrary at the moment; we'd appreciate input into how to make this better, especially if that input is based on reproducible performance measurements.

I've been doing some (very, very unscientific) benchmarking for my particular use case. My messages are laid out as follows, and I've also hand-written a native struct with the same spec:

message Atom {
    required ID id = 1;
    required ID cause = 2;

    oneof value {
        uint32 insert = 3;
        bool delete = 4;
    }
}

message ID {
    required uint32 luid = 1;
    required uint32 lamport = 2;
    required uint32 index = 3;
}

With a million-element array of each, proto takes up 5.7x more space in memory than native (153MB vs. 27MB) and is 12.5x worse when doing the equivalent to calloc (275ms vs. 22ms), 3.4x worse when doing a straight copy (67ms vs. 20ms), 10x worse with random access (3ms vs. 0.3ms), and about the same with random insertions (1ms vs. 1.2ms). On the other hand, encoding and decoding are quite quick and produce data with good size compared to the simple binary coder I'm using. It's certainly usable, but not ideal. That size difference in particular is very troubling.

I tried manually fiddling with the generated Swift file to remove StorageClass from the generated structs. The resulting code seems to work fine, but is very strangely slower in almost every respect than the heap-allocated code! Need to dig into this more, doesn't make much sense to me.

Glad to hear that there's nothing intrinsic about StorageClass w.r.t. message coding. This means that, at the very least, a flag could be set to make everything a struct if the user knows what they're doing. Same with unknown field handling, if it's indeed what's causing the size to balloon so much. (Could data in UnknownStorage be made a class?)

archagon commented 6 years ago

OK, I've made a test build of protoc-gen-swift with _StorageClass disabled, and also turned UnknownStorage into a stub class. Now performance is mostly correlated to struct size. The new proto Atom structs are about 2x the size of my native structs, and performance is around 2x worse than native in every comparison. Re-enabling _StorageClass while keeping UnknownStorage disabled makes some things a lot slower than native (calloc) and some things surprisingly faster (straight copy, random insertion), though maybe I'm not doing a good enough job of clobbering the cache between tests, or making sure that all the data is actually copied.

Heap or not, I really think something needs to be done about that UnknownStorage memory.

(FWIW, the reason for that remaining 2x size difference is that the proto structs get poorly packed on account of the optionals. Three UInt32s take up 12 bytes, while three UInt32?s take up 21 bytes. I wonder if the required field rule could be used to optimize this? Manually removing the optionals in ID seems to have no ill effects. EDIT: Figured out that proto3 does not require the optionals. Now everything is quite compact.)

archagon commented 6 years ago

Oops, yeah, my copy tests weren't actually copying anything on the _StorageClass side due to the _uniqueStorage trick. With the test corrected and using the current protobuf code, straight copies are 20x (!!!) slower than native, 401ms vs. 22ms; with _StorageClass disabled and UnknownStorage nerfed, only 2x slower than native. (And naturally, the heap penalty will cascade with additional nested messages.)

archagon commented 6 years ago

For dealing with UnknownStorage, would something like this work? (It passes all the tests.)

public struct UnknownStorage: Equatable {
  public static let emptyData = NSMutableData()

  /// The raw protocol buffer binary-encoded bytes that represent the unknown
  /// fields of a decoded message.
  internal var _data: NSMutableData = emptyData
  internal mutating func _uniqueData() -> NSMutableData {
    if !isKnownUniquelyReferenced(&_data) {
      _data = NSMutableData(data: _data as Data)
    }
    return _data
  }
  public var data: Data { return _data as Data }

  public static func ==(lhs: UnknownStorage, rhs: UnknownStorage) -> Bool {
    return lhs.data == rhs.data
  }

  public init() {}

  internal mutating func append(protobufData: Data) {
    _uniqueData().append(protobufData)
  }

  public func traverse<V: Visitor>(visitor: inout V) throws {
    if !data.isEmpty {
      try visitor.visitUnknown(bytes: data)
    }
  }
}

Testing with a ~1 million element array of 27-byte structs, I see 1.6x better memory use over the current approach. (125,044 kb vs. 78,140 kb.) Still a long way to go for near-native sizing, though. (27,448 kb.) I am able to get there by removing UnknownStorage altogether, but that won't be good enough for many use cases.

EDIT: Although, encoding speed now suffers quite a bit now because of the bridging. This hack seems to fix it. I presume there exists a better way to treat Data as a reference type and not a value type, but I'm not aware of it.

public struct UnknownStorage: Equatable {
  internal class DataReference {
    var data: Data
    init(data: Data) {
      self.data = data
    }
  }

  internal static let emptyData = DataReference(data: Internal.emptyData)

  /// The raw protocol buffer binary-encoded bytes that represent the unknown
  /// fields of a decoded message.
  internal var _data: DataReference = emptyData
  internal mutating func _uniqueData() -> DataReference {
    if !isKnownUniquelyReferenced(&_data) {
      _data = DataReference(data: _data.data)
    }
    return _data
  }
  public var data: Data { return _data.data }

  public static func ==(lhs: UnknownStorage, rhs: UnknownStorage) -> Bool {
    return lhs.data == rhs.data
  }

  public init() {}

  internal mutating func append(protobufData: Data) {
    _uniqueData().data.append(protobufData)
  }

  public func traverse<V: Visitor>(visitor: inout V) throws {
    if !data.isEmpty {
      try visitor.visitUnknown(bytes: data)
    }
  }
}
thomasvl commented 6 years ago

Sorry, didn't get a chance to really think about this until today.

I was going to say the bridging hit of using Foundation could be non ideal, and I wasn't sure if that could cause any hiccup for folks on linux. But with your edit, you are basically doing what we did with the _StorageClass support but for the data itself.

In the cases where the message is already using storage, it might be possible to just remap the unknownFields into the storage, that way it never ends up at two heap references for those messages. But that likely means sticking a protocol on the unknownFields of the message so it can use something like the above or use something that shims over to something inline in the _StorageClass.

archagon commented 6 years ago

Would that really make a huge difference, though? Standard Data does a heap allocation under the hood anyway, and the DataReference approach would only require one tiny additional heap allocation for the "pointer". Even if you store the Data in _StorageClass, you'll still have the two heap allocations.

tbkka commented 6 years ago

I looked at the implementation for Data and I think I see the problem. Here's the relevant part:

public struct Data {
    @_versioned internal var _backing : _DataStorage
    @_versioned internal var _sliceRange: Range<Index>
}

In a nutshell, Foundation's Data has three values in it: A pointer to the backing storage, and two integers representing the first and last valid byte offsets in that storage. I presume this was done to make it very fast to return a sub-Data object: you don't need to copy the backing storage, just create a new struct with different first/last offsets.

Your proposed versions all have a single pointer and no other data. That basically removes two integers, cutting a 24-byte struct down to just 8 bytes. That's significant for small messages such as yours, especially in the common case where there are no unknown fields.

In your first version (using NSMutableData), you could try making the _data field optional. An optional class field uses no more space (in C terms, it simply allows the pointer to be NULL) but it's faster to check for data == nil than to check for data.isEmpty in the traverse method.

thomasvl commented 6 years ago

Would that really make a huge difference, though? Standard Data does a heap allocation under the hood anyway, and the DataReference approach would only require one tiny additional heap allocation for the "pointer". Even if you store the Data in _StorageClass, you'll still have the two heap allocations.

If you count the allocation within Data, then yes, this model would always be three allocs instead of two; I was just trying to focus on the what the code was directly do.

As far as cost of those little ones. When we did some of the original benchmarking/profiling, the use of a single static instance for default storage and the static empty data and strings, came from some of the numbers @tbkka ran for benchmarks.

Since your original concern was the heap overhead when having lots of instances, it would seem like that logic could still be applied to the management of unknown fields. All it takes is an example like yours where it mostly avoids the heap, but a server gets revised and a new field starts getting sent. All clients that haven't been updated (since those tend to be at the mercy of end user habits to update their apps), would need the extra heap allocations for your thousands of objects, and performance would suddenly suffer.

Again, not saying it has to fold into the _StorageClass, just pointing out some of your initial concerns could just as equally apply to when unknownFields are needed.

tbkka commented 6 years ago

Worth noting: Your Atom message has < 16 primitive fields including the two nested ID messages. So there is an argument for not using a StorageClass in this case for Atom; the overall structure (including nested messages) satisfies our criteria for a "small" message.

Unfortunately, I don't have time right now to dig into the possible code generator changes myself.

thomasvl commented 6 years ago

Worth noting: Your Atom message has < 16 primitive fields including the two nested ID messages. So there is an argument for not using a StorageClass in this case for Atom; the overall structure (including nested messages) satisfies our criteria for a "small" message.

The current check is also that any single message/group field causes storage. This likely could be loosened to checking instead for transitive recursion through messages if we wanted.

Spin this out into #735, if we can nail down what we want the new logic to be over there, I should be able to make the changes pretty quickly.

tbkka commented 5 years ago

@archagon -- Could you try your tests with the changes proposed in #900 ? I'm curious how much of a difference it really makes.

thomasvl commented 4 years ago

Circling back to this -