Open lostluck opened 1 year ago
Can you assign this task to me? I want to try.
@tobehardest Done! In the future, you can self assign an issue by commenting .take-issue
and a bot will handle it. See the Beam contribution guide for more! https://beam.apache.org/contribute
What needs to happen?
A bug was found that if a user is converting arbitrary byte sequences to strings, to get around being unable to use
[]byte
as a key to a map. This leads to these strings to sometimes be non-UTF8 compliant, which will break on encoding/decoding.Eg. Converting the byte sequences like [2 208 15] or [2 239 191 189 15] to strings simply can't be round-tripped correctly as JSON, so the encoded and decoded values do not match.
The check would be to recursively examine every exported field in a structural DoFn for use of
string
, and checking if it's utf8 compliant. The check could be skipped for subtypes that implement the MarshalJSON and UnmarshalJSON interface methods.The vet runner which can be electively run before any given pipeline with the
--beam_strict
flag would be the appropriate place to add this sort of checking to avoid more expensive checks 100% of the time.A complete fix would also add documentation to the website and GoDoc around the JSON encoding of DoFns, in particular calling out this issue (that is, emphasizing strings must be UTF8).
Issue Priority
Priority: 3 (nice-to-have improvement)
Issue Components