MaxHalford / maxhalford.github.io

:house_with_garden: Personal website
https://maxhalford.github.io
MIT License
12 stars 5 forks source link

blog/ogd-in-sql/ #29

Open utterances-bot opened 1 year ago

utterances-bot commented 1 year ago

Online gradient descent written in SQL • Max Halford

Introduction Modern MLOps is complex because it involves too many components. You need a message bus, a stream processing engine, an API, a model store, a feature store, a monitoring service, etc. Sadly, containerisation software and the unbundling trend have encouraged an appetite for complexity. I believe MLOps shouldn’t be this complex. For instance, MLOps can be made simpler by bundling the logic into your database. In this post, I want to push this idea, and actually implement a machine learning algorithm within a relational database, using SQL.

https://maxhalford.github.io/blog/ogd-in-sql/

mgajda commented 1 year ago

Please provide performance timings!

It would be great to know if we lose anything in SQL, or is it actually faster than Python.

MaxHalford commented 1 year ago

Hey @mgajda! Actually the current implementation seems a bit slow. It takes ~1s to process ~5000 rows. I suspect this is because each recursion step is doing a join. I'm sure there are faster ways to rewrite my query. I wanted to focus on readability though.

Also, I can't justify why, but my gut feeling is that the database should be an order of magnitude faster than Python, provided the implementation were to be improved.

Note I've put the code in a notebook, see here. Feel free to run it yourself, it's pretty much self-contained thanks to DuckDB :)

pstef commented 1 year ago

I think the WITH PARTITION is a typo and you meant WITH RECURSIVE.

lmacedo-pivotal commented 1 year ago

Have you head of Apache Madlib? You should check that out, its exactly what you are proposing!

Teggy commented 1 year ago

Hi Max,

interesting writeup, really! Nice to see how the unpivoted data representation helps your case here.

Regarding the performance of the recursive CTE for online gradient descent. As you suspected above in your response to @mgajda, DuckDB repeatedly performs the join between state and stream and — in fact — the query recomputes stream (and thus X and y) in every iteration: DuckDB inlines the definition of stream (and X and y) into the recursive CTE. 🙈

Here in Tübingen we're currently working on an optional MATERIALIZED modifier for CTEs in DuckDB that evaluates a CTE once and holds on to its result for repeated scanning (similar functionality is available in PostgreSQL for some time now). Using that modified variant of DuckDB and materializing your stream CTE brings down the runtime to 0.2s (on my MacBook Pro M1).

PR for DuckDB is in preparation.

MaxHalford commented 1 year ago

Thanks @Teggy for digging. That's good to know about MATERIALIZED. I'm impressed by what you folks at Tübingen are doing.

Coincidentally, I just found out about you after watching your tutorial on connected components, which is what my next blog post will be about. Thank you for that.