apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.81k stars 4.23k forks source link

[Task][Go SDK]: Add String UTF8 check to vet runner for serialization. #24949

Open lostluck opened 1 year ago

lostluck commented 1 year ago

What needs to happen?

A bug was found that if a user is converting arbitrary byte sequences to strings, to get around being unable to use []byte as a key to a map. This leads to these strings to sometimes be non-UTF8 compliant, which will break on encoding/decoding.

Eg. Converting the byte sequences like [2 208 15] or [2 239 191 189 15] to strings simply can't be round-tripped correctly as JSON, so the encoded and decoded values do not match.

The check would be to recursively examine every exported field in a structural DoFn for use of string, and checking if it's utf8 compliant. The check could be skipped for subtypes that implement the MarshalJSON and UnmarshalJSON interface methods.

The vet runner which can be electively run before any given pipeline with the --beam_strict flag would be the appropriate place to add this sort of checking to avoid more expensive checks 100% of the time.


A complete fix would also add documentation to the website and GoDoc around the JSON encoding of DoFns, in particular calling out this issue (that is, emphasizing strings must be UTF8).

Issue Priority

Priority: 3 (nice-to-have improvement)

Issue Components

tobehardest commented 1 year ago

Can you assign this task to me? I want to try.

lostluck commented 1 year ago

@tobehardest Done! In the future, you can self assign an issue by commenting .take-issue and a bot will handle it. See the Beam contribution guide for more! https://beam.apache.org/contribute