apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.78k stars 4.21k forks source link

Rust SDK #21089

Open damccorm opened 2 years ago

damccorm commented 2 years ago

It would be great to have Rust SDK in order to create very high-performant yet safe pipelines.

Imported from Jira BEAM-12658. Original Jira may contain additional context. Reported by: ­­­.

gauravchak commented 1 year ago

Thanks @bvolpato for pointing me to this https://github.com/apache/beam/issues/21089 This would be extremely useful for us at Discord. We use Rust heavily and we are using Beam (using python) for data processing. To reduce training serving skew this would be very beneficial.

brucearctor commented 1 year ago

@gauravchak , any... Contributions welcome! Do advise if need some resources to get started, for example a related talk: https://www.youtube.com/watch?v=VsGQ2LFeTHY

nivaldoh commented 1 year ago

Hi, I would like to express interest in working on the Rust SDK. I'll create an incubator fork soon.

nivaldoh commented 1 year ago

.take-issue

nivaldoh commented 1 year ago

Work is underway here. Progress may be slow, and early code will look quite rough. I'll be really happy to receive any feedback or collaboration opportunities.

dofinn commented 1 year ago

@nivaldoh iv been looking for a reason to learn rust. Happy to take on any house keeping work that will slow you down

nivaldoh commented 1 year ago

@dofinn I really appreciate the offer. Currently we have a few TODOs with improvement ideas which I'd be happy to describe in more detail, and we could also try to coordinate effort on the larger tasks as well (which I'm also planning to organize in the main README file) if you're interested. Feel free to open a PR directly in the fork or reach out to me by email (nivaldo.humbertoo@gmail.com) if you'd like.

esadler-hbo commented 1 year ago

@nivaldoh thanks for doing this!

I will be an early adopter when you are ready for that.

nivaldoh commented 1 year ago

@esadler-hbo thanks for the support!

There's still a lot of work to be done, and I try to keep an updated roadmap here if you or anyone else is interested in a quick overview of the current state of the implementation.

In particular, the user API is starting to take shape, and the snippet below (inspired by the new patterns set by the Typescript SDK) is now functional end-to-end:

let runner = DirectRunner::new();

// Impulse won't be exposed to the user, but it serves as a mock transform for now
let transform = Impulse::new();

runner.run(|root| root.apply(transform)).await;

To anyone interested, any early input on this format would be highly appreciated, as this is the current foundation that I'll be using for everything else. Any other contributions (including early code reviews, even if partial) would be awesome as well.

TommyCpp commented 1 year ago

I also like to help! I have some experience with Rust but am pretty new to Beam or large-scale data processing framework

TommyCpp commented 1 year ago

@nivaldoh I think reading the doc will only get me so far and I should probably start working on some implementation and see how it goes. Is there any coder/transform or other stuff you want my help with?

robertwb commented 1 year ago

I just saw this, there's actually an effort to build a Rust SDK this week from the Dataflow team. What we have is at https://github.com/kennknowles/beam/tree/rust/sdks/rust ; it would be great to combine efforts. Though that one looks much further along.

brucearctor commented 1 year ago

Awesome! 100% the right move to combine efforts.

brucearctor commented 1 year ago

@robertwb and @kennknowles -- I'm glad you're looking into this. I have advised Nivaldo on strategies, with an eye on getting this to be something useful enough to warrant being merged into the proper project as another sdk. Your experience/knowledge attending to this will go a long way!

Maybe you two can dig a little into https://github.com/nivaldoh/beam/tree/rust_sdk and @nivaldoh can look at https://github.com/kennknowles/beam/tree/rust/sdks/rust -- suggesting that we wind only working on one or the other.

I wonder if it'd be easier to keep @kennknowles 's eyes on progress if developed in his repo? But, if the other is much further along might it better to jump into that [ @nivaldoh -- I assume you can give @kennknowles , @robertwb , and others relevant merge/commit permissions in your repo -- in things Beam they know their stuff and should absolutely be trusted ]. Else, it might then depend on the merge/migration path to get relevant bits into https://github.com/kennknowles/beam/tree/rust/sdks/rust, which I also imagine @nivaldoh might be open to taking on, as that could help solidify understandings, implementation.

I'll try to have a look to compare the repos over the weekend or sometime next week.

@nivaldoh - please advise on your thoughts/inclinations [ I've been sorta speaking for you, based on my read of you, motivation, inclinations when we've connected ]

robertwb commented 1 year ago

IMHO, @nivaldoh's repo is further along, and better structured, so I think it makes sense to start there. In the next day or two we'll probably be pushing willy-nilly to the one at kennknowles, in the spirit of the hackathon to explore ideas, but next week I suggest we start creating pull requests to https://github.com/nivaldoh/beam/tree/rust_sdk to carry anything over that has value (and isn't already in the latter) and continue there.

brucearctor commented 1 year ago

Sounds like a plan!

nivaldoh commented 1 year ago

@TommyCpp Besides some of the smaller TODOs spread out around the codebase, adding new coders such as DoubleCoder (mirroring from the Typescript SDK) could be a great way option since their structure is a bit more organized at the moment and they can be reliably tested in isolation. However, considering the current discussions, the Dataflow team is going to implement a lot of things in an upcoming hackaton and likely introduce different approaches, so it might be better to wait until then.

@brucearctor I agree with all your points. Additionally, @robertwb and @kennknowles need no introduction for me, so I've already sent both of them invites for collaborator access on my repo. What I really intend is to grant them owner permissions but I'm not completely familiar with this sort of thing on GitHub, so please let me know if any access is still missing after this (as well as anyone else who might require access).

I'm quite happy that we might be able to use what I've done so far as the base repo for the initial stages of the Rust SDK, but if it turns out to be a better idea to move the code there into https://github.com/kennknowles/beam/tree/rust/sdks/rust instead as @brucearctor mentioned, I'd be more than happy to make any adaptations necessary so that we may continue from there.

I'll also keep an eye on the progress there over the next few days to see if there's anything that could be changed in advance inside my repo. There are plenty of minor (such as the current module structure forcing me to import certain libraries in more than one Cargo file) and not so minor (such as downcasting coders from Any and using their literal TypeId as a key) things that need to be restructured soon over here, so I'll be looking for ideas to improve them and speed up the merge as well.

sjvanrossum commented 1 year ago

@nivaldoh As promised on the dev thread I've just opened a PR at https://github.com/nivaldoh/beam/pull/20 with some worker code changes as well as a container and boot script based on the existing SDK containers. I assumed that Rust pipelines would typically be statically compiled like Go pipelines, so the boot script only looks for a single artifact file at the moment. The binaries must match between the launcher and worker if we were to use serde_traitobject to serialize the DoFns, I've got some additional changes coming up to provide some scaffolding for that. The user binary needs to be able to switch between pipeline construction and pipeline execution mode, so there's an init function much like the Go SDK requires to run soon after the binary is started. That init function needs to be in a different place, but that would require restructuring the crates a bit I think. Happy to sync on that at some point, I think most of the framework code could live in an apache-beam crate and optional features could live in separate crates e.g. apache-beam-io-gcp/aws/azure. The worker code I had started on uses a concurrent cache, such that we don't need to lock on the worker to interact with the caches and such that we can expire entries in the cache like the Java SDK does. Looking forward to continue working on this with you!

nivaldoh commented 1 year ago

@sjvanrossum Thanks a lot, the PR is now merged.

I had other issues with the current module/crate structure as I mentioned in my previous comment, and with the point you brought up about the binaries I think this is a good time to change that. I think the crate structure you proposed makes a lot of sense, so that's what I'll be aiming for.

I'll start looking into this in a more general manner to join all current modules into a single crate, but please let me know if this would disrupt or heavily overlap with the upcoming changes you mentioned.

Looking forward to continue working with you as well!

Miuler commented 1 year ago

I'm interested in helping, I'm still new to Rust, but I'm already making my first contributions to the java SDK, and I wanted to do the same in Go, but seeing that you're getting started in Rust I'd like to join you.

laysakura commented 1 year ago

@nivaldoh I'm also interested in using and contributing to https://github.com/nivaldoh/beam/tree/rust_sdk/sdks/rust.

I am the author of SpringQL, an in-memory and single-node streame processor written in Rust. I'd like to support Beam as a programming model for newer versoin of SpringQL.

I will try to integrate the https://github.com/nivaldoh/beam/tree/rust_sdk/sdks/rust with our SpringQL and make some necessary changes to both repository.

laysakura commented 1 year ago

Unfortunatelly, it seems that @nivaldoh's repository is inactive as of February 1st, 2023. There are 5 pull requests that have not been reviewed or merged.

image

To address this issue, I have created a fork of the repository. In my fork, I have:

I welcome any contributions to this repository.

brucearctor commented 1 year ago

@laysakura -- thanks for keeping this moving!

brucearctor commented 1 year ago

@nivaldoh - thanks for getting it started, and please continue to collaborate as makes sense

dahlbaek commented 1 year ago

I'm interested in helping. I have some experience with Rust and the Beam Python SDK, along with previous experience with big data frameworks like Spark and Scalding.

It would be awesome with guidance as to how/where to get started contributing 🤔 Should one just grep for TODOs in the fork by @laysakura and submit prs for review? Or maintain one's own fork and submit prs to the fork by @nivaldoh?

brucearctor commented 1 year ago

@dahlbaek -- officially your questions are outside the scope of Beam project governance, since happening outside of the organization/official-repos.

To try to help keep things moving --> based on recent lack of activity from @nivaldoh , it seems more likely development to occur with @laysakura . Probably TODOs there, and/or maybe @laysakura will add some GH Issues or have other suggestions for concrete things that are bite-size enough for individuals to take on. In general, I believe the SDK will come together, so I would also imagine there is no shortage of things that could be accomplished around improved testing, automation, etc [ not to mention bug work, feature development, and more ]. I imagine PRs would be welcome by @laysakura ... but there could always be conversations in issues in https://github.com/laysakura/beam/tree/rust_sdk ...

All: I'm far from much of a Rust developer, but am happy to do what I can to ensure smooth collaboration and that we can eventually get this merged and as a proper Beam Rust SDK!

laysakura commented 1 year ago

I'm happy to receive help from @dahlbaek. I'll start the conversation on https://github.com/laysakura/beam/issues/1.

@brucearctor, I appreciate your assistance. We will report our progress here. I also hope to collaborate with @nivaldoh again. If @nivaldoh becomes interested in Rust SDK again, I would be happy if you contacted me.

Miuler commented 1 year ago

Desafortunadamente, parece que@nivaldohEl repositorio de está inactivo desde el 1 de febrero de 2023. Hay 5 solicitudes de incorporación de cambios que no se han revisado ni fusionado.

imagen

Para solucionar este problema, he creado una bifurcación del repositorio. En mi tenedor, tengo:

  • fusión manual de una rama de tema de @robertwb
  • (wip) dejó de usar Anyy, en su lugar, usó genéricos para los parámetros de entrada y salida de PTransform
  • hizo muchas otras refactorizaciones para hacer que el código fuera más parecido a Rust

Doy la bienvenida a cualquier contribución a este repositorio.

Ok, I understand that it is all new from the main project no? there is nothing from @nivaldoh's branch ?

Miuler commented 1 year ago

What is the most fluid conversation channel? Telegram? matrix/element? discord? slack?

laysakura commented 1 year ago

@Miuler

Ok, I understand that it is all new from the main project no? there is nothing from @nivaldoh's branch ?

laysakura/beam's rust_sdk branch is a fork from nivaldoh/beam's rust_sdk. I manually merged the following PRs created in nivaldoh/beam.

I'm sorry for not merging your https://github.com/nivaldoh/beam/pull/24 because https://github.com/nivaldoh/beam/pull/25 might make FIXMEs in https://github.com/nivaldoh/beam/pull/24 unnecessary (also described in https://github.com/nivaldoh/beam/pull/25).

You may check the git history in laysakura/beam by yourself.

What is the most fluid conversation channel? Telegram? matrix/element? discord? slack?

https://github.com/laysakura/beam/issues/1 is.

robertwb commented 1 year ago

Thank you @laysakura for rebooting this effort!

laysakura commented 1 year ago

@robertwb Thank you for creating the basic mechanisms of Beam, such as ParDo and GBK.

@dahlbaek and I are now working on creating a more Rust-like, statically-typed pipeline based on your work. We would also appreciate your contribution to laysakura/beam based on your extensive experience with the TypeScript SDK.

sjvanrossum commented 1 year ago

Oh, I seem to have missed some traffic on this issue. I received collaborator access to @nivaldoh's fork yesterday, but I'll move development over to @laysakura's fork. :)

I've got some work in progress for data channels on DataSource and DataSink and I'm currently drafting a change to serialization as mentioned on https://github.com/nivaldoh/beam/pull/22.

laysakura commented 1 year ago

@sjvanrossum Thank you so much! I sent an invitation to add you as a collaborator of laysakura/beam.

laysakura commented 1 year ago

I think it would be helpful to create design documents in order to align our goals and understanding. To start, I have written an initial version of a document titled "Custom Coders for the Beam Rust SDK".

A portion of the proposal outlined in the document has already been implemented and tested, as can be seen here: https://github.com/laysakura/beam/pull/30/.

I would especially appreciate it if @sjvanrossum and @dahlbaek, who have recently collaborated with me on the laysakura/beam repository, could take a look and provide any comments or suggestions.

Of course, I welcome feedback from anyone.

brucearctor commented 1 year ago

@laysakura and all: a little note to see whether interest has dwindled, or just other priorities, etc. It seems this had been taking shape and would be great to eventually get it in a state where we can get this merged into Beam.

laysakura commented 1 year ago

@brucearctor For me, I still has interests but cannot prioritize Beam-related work in my company for a while 😞 It would be great if others lead the Rust SDK's development.

brucearctor commented 1 year ago

@brucearctor For me, I still has interests but cannot prioritize Beam-related work in my company for a while 😞 It would be great if others lead the Rust SDK's development.

Makes sense -- our abilities to devote work time to various efforts does change over time.

Sounds like a call for Any/All that are interested to consider stepping in and helping/contribute!

dahlbaek commented 1 year ago

From my side I'm still interested in contributing, but I do it on my own time, and haven't had much to spare lately.

brucearctor commented 1 year ago

From my side I'm still interested in contributing, but I do it on my own time, and haven't had much to spare lately.

@dahlbaek : Totally understandable, and one of the nice things about Open Source!

sjvanrossum commented 9 months ago

@brucearctor I have Coder/DoFn serialization and authoring functionality in the works that I've only been able to progress on again and off again, but and I'm happy to support contributors if they wish to contribute PRs (not sure if my remote branch is fully up to date, but I'll take a peek). My situation matches that of the other contributors, I've unfortunately had to prioritize my core role over this for the past few months.

brucearctor commented 8 months ago

Just found this --> https://github.com/swiftdiaries/beam-rust hadn't dug deep, unclear the extent of what inside it might be usable here.

sjvanrossum commented 7 months ago

Seems like it didn't progress beyond "Hello, world!" unfortunately.