apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.8k stars 4.23k forks source link

[Bug]: Direct Runner doesn't use coder registered in registry? #29908

Open hjtran opened 9 months ago

hjtran commented 9 months ago

What happened?

I'm trying to write a coder for an unpicklable object, but when I register it with the coder registry, the direct runner seems to want to try to pickle it anyways. I've created an example in beam playground

Not sure if I'm just missing something trivial here

Issue Priority

Priority: 3 (minor)

Issue Components

hjtran commented 9 months ago

Might this be related to #18490

hjtran commented 9 months ago

18490 was a red herring. The issue isn't exactly with the python direct runner either. I think the issue is that apache_beam.transforms.util.ReshufflePerKey uses type hints Any and Any data use the picklecoder rather than any specially specified coder in the coder registry.

tvalentyn commented 8 months ago

have you tried setting with_output_types / with_input_types explicitly after create or on reshuffle ?

hjtran commented 8 months ago

Yes, that indeed works. I think the issue is more that when this happens, it's difficult to identify why, especially if you think that the registry coder will get respected all the time.

I have a limited fix that I haven't gotten around to posting yet that narrows the type definitions in ReshufflePerKey for global windows. This fixes some part of the issue