Open BalestraPatrick opened 2 years ago
Code size is a common concern with code-generated approaches such as this. Protobuf implementations for some other languages rely heavily on reflection which makes them smaller but significantly slower.
If you're only using the binary encoding, it should be easy to strip out the field names and other content that's only there to support JSON and TextFormat encoding. Right now, I think this would require a small change to the code generator, but I've long been interested in emitting that content as separate .swift sources that contain only those extensions. It would then be easy to delete those files. (Alternately, we could consider splitting the JSON and TextFormat support into a separate generator.) You could also look critically at whether there are other parts of the generated code that you might omit: For example, the generated ==
implementations are somewhat bulky and may not be needed in your application.
fyi - #18 is open for tracking splitting out the textual support.
Wrote a little wrapper to patch the generated swift source code to remove conformance to SwiftProtobuf._ProtoNameProviding
, which seems to have shaved off about 10% of the total binary size of the generated Swift protobuf in our app (according to the linkmap). Would be nice for this to be an option in the generator for sure!
I briefly looked into removing ==
as well but _MessageImplementationBase
is Hashable
so it needs an implementation of it or a change to the runtime.
update: Turns out we're using JSON encoding/decoding a little bit in the codebase and can't merge this, sadly
Another improvement I wanted to look at in this area to reduce the amount of code generation was to make serialization and other related functionality (hashing, equatability) table-driven. Unfortunately, the only way to get static arrays of constant data into a data segment is through a SIL transform that only runs on optimized builds, and even when that transform applies is very unpredictable. If it isn't applied, then we'd end up generating code that heap-allocates those arrays and populates them element-by-element, and that code would run the first time a particular message is serialized, parsed, equality-tested, or hashed, which would make client code performance unpredictable in ways that we should probably avoid*.
* To be fair, this is already happening with the name tables we generate for text/JSON serialization, but that's restricted to a much smaller set of serialization operations that are expected to be less efficient than binary format.
What if there was a option to opt-in to only one serialization mechanism? Say a client only needs binary encoding / decoding? Would that make any difference in the size of the generated code?
The idea of having an opt-in is a good one, and it's something we've discussed on many occasions. It would certainly make some difference, though someone would have to actually try it and measure to figure out how much savings. But the detailed design is tricky:
At this point, I would say that we have lots of good ideas; we really need some folks to actually try implementing some of these ideas and see how well they work out.
Since a Visitor/Decoder pattern is used by the library, there isn't a lot of code specific to the formats. At the moment, the file numbers and binary encoding information is part of the base generated code, as that's a very small amount of data. The textual support then layers on the needed mapping between field numbers and the names. Since the JSON names can mainly be derived from the TextFormat names; most cases, it means we just need one string and a marker saying we can derive the other one. Splitting that in two completely different things could result in even larger code when folks need both since we'd potential be more verbose instead of allowing things to be derived.
One thing #1240 doesn't yet take on is splitting up the core runtime library so if you don't need the textual formats, you don't have to link that backing code. No effort as been done to see how much that might save/etc. Using that PR as a starting point would likely make some sense to start getting more clarity into what the potential savings would be.
👋 Related with the size of the generated code, the size of the SwiftProtobuf SDK itself is also considerable: 1.4MB for latest version 1.25.2 (this is the size of the binary built statically inside a production app - measured using linkmap).
Adding this comment here with the size information just for context
Hello!
Many parts of our codebase use SwiftProtobuf. Recently we started tracking app size in a more accurate manner and we noticed a trend that is pretty worrying for us. Generated Swift protos code increase our app size a lot. Recently, we removed a single proto file that was about 400 LoC which contained about 70
message
definitions (including the various transitive imports) and the generated code was about 5KLoC. 304KB of our app size was attributed to symbols coming from the generated Swift Protobuf code.We are building with
SWIFT_OPTIMIZATION_LEVEL = -Osize
in release mode but I wonder if there are other ways to reduce the size of the generated Swift code.I can't exactly share my full proto, but I was wondering if this is a known issue with Swift or maybe there are ways to reduce the impact of the generated code. Does anyone have experience with this particular issue?