apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.89k stars 4.27k forks source link

[Bug]: ProtoCoder fails on super large protobufs #28868

Open snallapa opened 1 year ago

snallapa commented 1 year ago

What happened?

ProtoCoder can't encode messages that are above the Integer.MAX_VALUE bytes (2.147483647gb). This is due to the following https://github.com/protocolbuffers/protobuf/blob/d41b9e7a4b2834572cc396c41809fd46ea5eb5d4/java/core/src/main/java/com/google/protobuf/MessageLite.java#L55 can return negative. It seems like there are a lot of undefined behaviors when there are protobuf this large.

However, it at least seems that ProtoEncoder can check the size and writeDelimited to the output stream to solve this in this method https://github.com/apache/beam/blob/b87629928c92b3df4ba88fb42184d5ed505dbe2e/sdks/java/extensions/protobuf/src/main/java/org/apache/beam/sdk/extensions/protobuf/ProtoCoder.java#L190

Naturally, no one should probably have protobuf's this large anyways, but here I am!

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

liferoad commented 1 year ago

cc @robertwb @kennknowles

kennknowles commented 1 year ago

@snallapa want to contribute the fix? šŸ˜„ šŸ˜„ šŸ˜„

snallapa commented 1 year ago

@kennknowles if I find the time in the next week or so I will try to!