Open gonzojive opened 2 years ago
Hey @gonzojive thanks for the suggestion - broader use of generics is absolutely something we want to see more of!
This doesn't do everything you're talking about here, but I did add generic registration functions which will be part of the next release (2.40). The docs aren't merged yet (here if you want a sneak peak), but it should allow you to register your DoFn, register all associated types, and do some performance optimization using reflection with a single call.
This is just a start, and we should absolutely keep pushing usage of generics forward to improve the experience here.
I like the general thrust of this suggestion! There's plenty to do to use generics with beam (see Beam Java's probable over-abundance of Generics). Currently we have the goal to carefully add generics without breaking existing code.
First as Danny points out is the generic registrations, which make it easier to improve performance by removing per-call execution time reflection.
We'd also like to introduce a "real" KV type, and make sure that's understood by the whole framework. No more splitting KVs (except to support older code). Similarly, generic iterator and emitters types would let the Go compiler do more work for inlining and efficient calls. While we're probably going to need to design our own "emitter" type or interface, it feels inevitable that Iterators/Iterables will have an official Go solution by Go 1.20, so we'd want to wait for that to become locked down instead of inventing our own interface/type.
Making the "front end" generic with typed PCollections is difficult once one gets to features like Side Inputs. I don't see fully compile time checked side inputs in the near term.
Either way, it's important to note that while the SDK's core will take time to include generic awareness, user DoFns and transforms will be able to take advantage of them much sooner.
All that said, something like the code in this example, could be added to a package under ...pkg/beam/x/.. so we can allow users to experiment with them before we fully commit to a given wrapper/approach.
Thanks for sharing your thinking.
One issue I punted on is the usage of any
in T any
below:
func Create[T any](scope beam.Scope, obj ...T) Collection[T] {
asInterfaces := lo.Map(obj, func(elem T, _ int) any { return elem })
untyped := beam.Create(scope, asInterfaces...)
return Collection[T]{untyped}
}
It seems like the signature should be more like this to ensure things that can be encoded are being passed around:
func Create[T Serializable](scope beam.Scope, obj ...T) Collection[T] { /* ... */ }
Agreed, but unfortunately "Serializable" is not a real concept WRT Go, let alone as generics.
As it stands, primitive types, and the Exported fields of structs (so, exactly like JSON's restrictions) are efficiently encoded using Beam schemas. Interfaces, Functions, Channels are not encodable because there's no good way of doing so, without the registration lookup tables (largely handled going forward with the register package Danny mentioned).
Anything is serializable with enough prepwork though, via lookup tables or enums, or pre-known keys. It's simply difficult to make things work on distributed workers with only runtime knowledge otherwise.
Aside from deferring to the standard proto.Marshaller for protocol buffer messages (determined via the proto interfaces), users can also register their own arbitrary coder functions (which also gets around the "exported fields" restrictions). You've already filed a task for us to improve the documentation for that escape hatch (#21930).
Just a note WRT the register
package. With Go 1.21, type inference is improved sufficiently that one no longer needs to explicitly list out ProcessElement types in the register.DoFnNxM calls. Registrations are still required, but that should make things simpler.
To slightly contribute to the wrapper library, this function adds a bit more runtime type safety for converting beam.PCollection
s:
// CollectionOf returns a typed PCollection converted from the given
// beam.PCollection. It exists for documenting purposes. If T doesn't match
// with what is actually in pc, the function panics.
func CollectionOf[T any](pc beam.PCollection) PCollection[T] {
var z T
if reflect.TypeOf(z) != pc.Type().Type() {
panic(fmt.Sprintf(
"CollectionOf given unexpected PCollection type: expected %v, got %v",
reflect.TypeOf(z), pc.Type().Type()))
}
return PCollection[T](pc)
}
Any progress on this mattern? I would love to contribute!
@luillyfe No, but also Yes!
No in the sense of attempting to do so within the current SDK framework. It either involves having to maintain separate but compatible layers, and mixing and matching with existing DoFns.
The best contribution would be to attempt to get a real generic KV in, as mentioned in https://github.com/apache/beam/issues/21929#issuecomment-1159121423. This would mean ensuring that
If someone goes through all of that, which are table stakes for the feature, then we can get it in. Note, this doesn't have to be in one giant PR, we just can't have a beam.KV type exist in the beam user surface package without it being useful. We don't want to break people.
So that's the No case.
What I mean by Yes is that I've been exploring the idea of a Generic forward Beam Go SDK in my own personal repo. I gave a talk about it at Beam Summit this year, but the recording isn't up yet in the playlist (titled Beam SDKs don't have to look the same
).
You can see a bit of it here: https://github.com/lostluck/experimental/blob/master/altbeams/allinone2/allinone2.go#L137
An older benchmark run: https://github.com/lostluck/experimental/blob/master/altbeams/allinone2/beam/beam_test.go#L129
It's not quite ready yet for others to contribute to, but maybe I can finally get it out and basically batch useful. The remaining things to do are a BlobIO/FileIO, a TextIO that uses that, windowing (though they are passed around properly), initial Documentation and examples, and of course moving it out of my experimental repo into it's own stand alone spot. But I do have SplittableDoFns working, and have a strong opinion on the affordances allowed.
I mention this here since handling the KVs as trickier than I expected, and influenced much of the coder/type handling. It's possible for us to integrate some of these ideas into the existing SDK, but it'll be tricky.
I didn't really set out to actually make a replacement SDK, but to explore ideas and see what would work, but having a low per-element overhead and zero allocations on the hot path is compelling by itself.
What would you like to happen?
Generics make beam code more readable.
Type registration is somewhat annoying because multiple type registrations are required for a single generic DoFn[T, U, ...], one for each concrete type required for a pipeline.
Here's a little library I wrote to play around with this. It works but doesn't cover the full API and isn't super thoughtful/consistent:
Issue Priority
Priority: 2
Issue Component
Component: sdk-go