apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.78k stars 4.22k forks source link

[Bug]: sdk-go anonymous dofn function some times not working #29209

Open jhw0604 opened 10 months ago

jhw0604 commented 10 months ago

What happened?

anonymous dofn function some times not working

in windows start with go run . then anonymous dofn function not working

and when compiled exe file start with filename is not working but filename.exe is working

becouse go/pkg/core/util/symtab/symtab.go -> symbolData function -> line 92 pf, err := pe.NewFile(f) is fail without err

so... how about like this....

func TryParDo(s Scope, dofn any, col PCollection, opts ...Option) ([]PCollection, error) {
    RegisterDoFn(dofn) //auto registe dofn
    ...
}

and beam.Init() before beam.Run function

it'll be anonymous dofn function working and no more need RegisterDoFn even if named function

the problem with this is that it is not compatible with the existing code due to the Init time issue, so I think it would be useful if only this problem was solved.

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

lostluck commented 10 months ago

First comment: This might be an issue with Beam Go on Windows (without using WSL). To my knowledge, we just haven't done any beam Go development on windows, and thus haven't implemented support for handling it's specific behaviors.

Basically, without registration, we do a magic lookup in the binary's DWARF symbol table for the function name. From your output, we don't appear to handle looking for the EXE at all.

That would happen here:

https://github.com/apache/beam/blob/03c811e296e96ae682f53b877e9bc4c2820b36e7/sdks/go/pkg/beam/core/runtime/symbols.go#L56


Second:

Essentially due to the current structure of the SDK, we have beam.Init() called as early as possible in the binary. The point of this is to reduce the set of code that is executed on workers, that is only used as part of pipeline construction. So any registrations and such must happen before that point. The simplest thing is we recommend it happens at package init time.

One can have automatic construction time registration which largely means that the pipeline graph is constructed (and all that code is executed) on every worker, or one can allow for arbitrary construction time work. Trying to satisfy both becomes awkward because then there ends up being two different ways of doing things, which makes the SDK (even) harder to learn.

That said, I'm skeptical that the concept of beam.Init was the right approach generally for Beam Go, especially with the larger focus on Declaritive beam via YAML and similar. It wouldn't be my first choice for a next version of Go SDK. (And indeed it's not, I have an increasingly complete research prototype Go SDK that's trying out the "don't register, just rebuild" approach... It's not yet ready for sharing though.)