MaxHalford / maxhalford.github.io

:house_with_garden: Personal website
https://maxhalford.github.io
MIT License
13 stars 5 forks source link

blog/future-of-river/ #26

Open utterances-bot opened 1 year ago

utterances-bot commented 1 year ago

The future of River • Max Halford

Source When I see tweets like this one, I’m both happy because people are aware of River, but also irked because it’s really difficult to make production-grade open source software. We just had a developer meeting a week ago. We planned what we will work on during the first half of 2023. I thought it would be worthwhile to give a high-level view of how we envision River’s future. If not to be comprehensive, at least to reassure potential users that River is alive and kicking 🤺

https://maxhalford.github.io/blog/future-of-river/

cyrilou242 commented 1 year ago

Good to know the project is alive and kicking! Your source code and the interface you designed are a great inspiration for my work.

About performance, I think it's not only about raw speed, but also about ecosystem. It's easy in batch to do the training step in python xgboost/lightgbm, then export to C/java for serving. In online systems this tends to be harder, given the process that trains and that predicts is (often) the same. From what I've seen (might be biased), a lot of long-living processes like servers and distributed data processing workflows run on the JVM (think hadoop, flink, spark, apache beam, and a lot of monolith/microservice servers) or in C/go. Also python is far behind these languages in terms of observability and debuggability in production. So it's not clear to me how light-river (with a focus on python bindings I guess?) will integrate in the big data ecosystem. It's less sexy but have you ever considered building tools for java? I like how the river interface is clear and easy to integrate in larger systems, would love to see a light-river in the JVM. There is existing work but it tends to mix ml, compute, and visualization frameworks, or rely on a single execution platform. Eg https://moa.cms.waikato.ac.nz/. The space is quite empty, with way less competition than in the python ecosystem.

MaxHalford commented 1 year ago

Hey @cyrilou242! You make a great point. You are probably right: the Java ecosystem takes up a large chunk of the stream processing ecosystem. And as you say, the Python ecosystem is lagging behind in that ecosystem. There are two reasons why we want to do things this way though.

The first is that the people I know and myself are used to Python and Rust. We simply do not have the time to change our habits and skillset. I know this is a poor answer, but in practice in means a lot. Especially considering most of the people who work on River/Beaver are doing so in their spare time. There's a "fun" aspect which we just can't ignore.

The second reason is my belief that compute is shifting towards databases, and moving away from the programming language and its runtime. If you look at Beaver, most of the stuff is done in Redpanda/Materialize. Python is only used to predict with and update a model. With light-river, if we are able to compile models in WASM, we can move all the inference/training logic to the database. We'll see if this pays off, but I have good hopes.

cyrilou242 commented 1 year ago

Hey @MaxHalford,

The first is that the people I know and myself are used to Python and Rust. We simply do not have the time to change our habits and skillset. I know this is a poor answer, but in practice in means a lot.

oh no that's a totally legit answer!

Interesting take on WASM. I'll follow this. Thanks again for working on this nice lib!