Arwalk / zig-protobuf

a protobuf 3 implementation for zig.
MIT License
219 stars 25 forks source link

bootstraps the generator plugin and compiles the plugin with its own … #17

Closed menduz closed 1 year ago

menduz commented 1 year ago

This PR chronologically

  1. fine tunes the Go code generator and fixes some issues with the protobuf.zig library
  2. creates a Zig native plugin for protoc
  3. fine tunes the plugin's output until the plugin can compile itself with the output of the previous run
  4. removes the go dependency
  5. adds and fixes tests

Original PR and history at https://github.com/menduz/zig-protobuf/pull/1

Pending work:

Arwalk commented 1 year ago

Starting the review now. Might take some time considering how much work you did :)

menduz commented 1 year ago

Many thanks for your review. I've pushed most of the suggestions and I'd like to discuss about the optional/required-ness of the fields. I also mistakenly pushed a new set of changes to the main branch. Including oneof support and partial support for maps using lists of entries. Also I cleaned up the code and added mixins to reduce the emitted code (pub usingnamespace protobuf.MessageMixins(@This());). I've also added a good amount of tests including unitary tests, and some "blackbox" tests with big encoded files.

Optional/Required fields

In regards to optional fields, as of now, all fields are being emitted as optional unless required is explicitly specified in the .proto file. The wire protocol doesn't break if any of those required are omitted. The for knowing the value are relatively straight forward:

From my experience, the use of some required fields was not a deal breaker, but the protocol enables us to swap a ?u32 by a u32 and I can see the benefit on that. Would you elaborate on your PoV?

Oneof

I revamped your original code for unions and completed the encoding/decoding part and added extensive tests. Please review that new part.

Maps

For simplicity, I've leveraged the default behavior for protoc and enabled maps via list of autogenerated entries. This is something I'd like to work a bit more but to close up some work I'd like to defer it for the next PR.

Open questions about allocations and code speed

Disclamer: you may notice silly things in my zig code, this is the project I'm using to learn most of the language. I do have extensive experience in other languages including C++ that I can certainly leverage right now.

I thought about enabling all strings and submessages as pointers. Because recursive messages are currently not possible.

 const Message = struct {
-  recursive: ?Message
+  recursive: ?*Message
 }

But that adds a considerable overhead for the daily use, I understand that messages are currently stored entirely on the stack and are passed to functions as reference or copied, but that's out of our control. Pointers may fix this to a certain level, also it would enable more precise control of *WHERE* in the memory we store each message.

I was tempted to introduce it like the ArrayList does, something like the following snippet, but dozens of unanswered questions arised. Most of them are unknown unknowns, since my experience with zig is limited. I'd love to have some equivalent concepts to "std::move" to anchor some concepts.

fn Messagee(comptime allocator: Allocator) {
  return struct {
    ...
  };
} 

And taking some learnings from the protobuf C++ library, I think at this point we should have a special (arena) allocator for all messages. Because if we keep the recursive deinit semantics, we may be freeing some parts of the memory that other parts of the application are still considering alive. This exact same point applies to strings. If we swap []const u8 by a pointer, how do we decide over the lifecycle of that string?

This is a new talking point that I think deserves our attention.

menduz commented 1 year ago

After a good night sleep, I wrapped my head around the default behaviors of proto3 and your comments about optional/required. SubMessages are still optional but the rest of the fields are considered not-optional in zig unless explicitly specified in the .proto.

The code looks much cleaner and default values now make sense, here are the changes https://github.com/Arwalk/zig-protobuf/pull/17/commits/1404418ba0002b8455c36a6fdba26e33e8572e28

Arwalk commented 1 year ago

I took a look at what has been done since. Always as happy about your contributions. I'll have some time tomorrow afternoon to take a look in more details and answer our current subjects. It's a pleasure to work with you.

Arwalk commented 1 year ago

Ok, i think i checked everything i wanted to check. Let's review your comments/questions

Optional/Required fields In regards to optional fields, as of now, all fields are being emitted as optional unless required is explicitly specified in the .proto file. The wire protocol doesn't break if any of those required are omitted. The for knowing the value are relatively straight forward:

In initialization of a message (zig), pick the default value of the field type (false, 0, null, "", enum=0). Unless a default value is specified in the .proto, in such case, pick that value. In "reading" a message, only change a result.value if it appears in the encoded message. Otherwise use the default value. Similar to vat t: T = undefined; in zig. From my experience, the use of some required fields was not a deal breaker, but the protocol enables us to swap a ?u32 by a u32 and I can see the benefit on that. Would you elaborate on your PoV?

After a good night sleep, I wrapped my head around the default behaviors of proto3 and your comments about optional/required. SubMessages are still optional but the rest of the fields are considered not-optional in zig unless explicitly specified in the .proto.

The code looks much cleaner and default values now make sense, here are the changes https://github.com/Arwalk/zig-protobuf/commit/1404418ba0002b8455c36a6fdba26e33e8572e28

Ok, I checked current implementations and i see where you're coming from. The truth is that i mostly used protocol buffers 2 and i didn't realize current protobuf implementations consider "required" fields as completely optionals. I checked and generated java and python implementations and i noticed that indeed there are default values for required fields now.

So in the end, you are right with your implementation, your handling of required field is correct. My last "problem" with that would be that i'd love if we could generate on the spot an init method that forces to have the required value as parameter (to avoid setting a 0 by default when creating the structure while allowing this default value when decoding), but that might be a bit too much.

I revamped your original code for unions and completed the encoding/decoding part and added extensive tests. Please review that new part.

The only thing i have to say is that i'm sad that i could not find the solution myself when i tried. Good job.

For simplicity, I've leveraged the default behavior for protoc and enabled maps via list of autogenerated entries. This is something I'd like to work a bit more but to close up some work I'd like to defer it for the next PR.

Awwww i liked my implementation :( Joke aside, it's fine. But we can delete the KeyValueTypeTag enum now, it's not used anymore.

Open questions about allocations and code speed blablabla

This is a complex subject, and i don't have an answer for you. I'm not a huge fan of deciding to use an arena allocator, even if understand completely your point. The point of the allocator pattern is to allow the end user to choose his own allocator, even allowing some constrained environment (such as embedded) to make a "virtual" dynamic memory allocator over a pool of stack memory.

I think we should allow ourselves to have some holes in our implementation. Having a functionning library for the most common cases will allow us to publish it and have some more feedback, and maybe even more help in the future. Let's just put a "HERE BE DRAGONS" sign around using recursive messages with dynamic memory fields inside for now.

In conclusion, PR is fine now, i'll just wait some sign that you read this (comment or emoji) to know if you have something more to say, and i'll merge it happily.

Next step will be cleaning up the README, cleaning up the issues, then make an official release.

menduz commented 1 year ago

Thanks for the detailed answer. I think we are good to merge the PR now, and based on your comment, we should.

However, I think the library leaks memory on strings. For data-section static strings it won't. But if we are using any means to allocate arrays of bytes in runtime I think it will.

Probably this should be handled right after this PR, along with the maps.

Maybe we can provide the allocator as a parameter to the deinit function, enabling us to 1) swap ArrayList for ArrayListUnmanaged 2) clean up memory on submessages 3) clear strings if those are not static. I assume there is some heuristic we can leverage to know if a pointer is static memory or dynamic. To prevent allocator.free(const).

The biggest drawback may be that in order to set string values, we may need to have a ManagedString and ManagedBytes structs. Those would have copy semantics, in effect it would be something like this:

 const Message = struct {
-    m: ?[]const u8
+    m: ?ManagedString
     ...
 }
 ...
 /// allocates memory for a string and copies the provided slice
 fn managedStringFromSlice(str: []const u8, allocator: Allocator) !ManagedString {
     return ManagedString{ .slice = try allocator.dupe(u8, &str) };
 }
 ...
 var m = proto.Message.init(allocator);
-m.str = "my string";
+m.str = managedStringFromSlice("my string", allocator);
Arwalk commented 1 year ago

I agree with everything in your last comment. I'll take it into account when i'll do some cleanup on the issue side. I was coming to the conclusion too that we needed something a bit specific for strings.