apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.81k stars 4.23k forks source link

[Bug]: ProtoCoder fails on super large protobufs #28868

Open snallapa opened 12 months ago

snallapa commented 12 months ago

What happened?

ProtoCoder can't encode messages that are above the Integer.MAX_VALUE bytes (2.147483647gb). This is due to the following https://github.com/protocolbuffers/protobuf/blob/d41b9e7a4b2834572cc396c41809fd46ea5eb5d4/java/core/src/main/java/com/google/protobuf/MessageLite.java#L55 can return negative. It seems like there are a lot of undefined behaviors when there are protobuf this large.

However, it at least seems that ProtoEncoder can check the size and writeDelimited to the output stream to solve this in this method https://github.com/apache/beam/blob/b87629928c92b3df4ba88fb42184d5ed505dbe2e/sdks/java/extensions/protobuf/src/main/java/org/apache/beam/sdk/extensions/protobuf/ProtoCoder.java#L190

Naturally, no one should probably have protobuf's this large anyways, but here I am!

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

liferoad commented 12 months ago

cc @robertwb @kennknowles

kennknowles commented 12 months ago

@snallapa want to contribute the fix? šŸ˜„ šŸ˜„ šŸ˜„

snallapa commented 11 months ago

@kennknowles if I find the time in the next week or so I will try to!