feast-dev / feast

The Open Source Feature Store for Machine Learning
https://feast.dev
Apache License 2.0
5.46k stars 977 forks source link

Add `deltalake` backend #3865

Open ion-elgreco opened 9 months ago

ion-elgreco commented 9 months ago

Is your feature request related to a problem? Please describe. I am considering using feast but the main back-end we use is not supported, which is deltalake. Deltalake is the only data lake implementation that has read and write support without a JVM. This makes it fairly easy to build large data lakes with only Python but there is no easy way with out a feature store wrapper to make features easily accessible..

Describe the solution you'd like Add deltalake as an officially supported back-end.

Describe alternatives you've considered There aren't really any.

sudohainguyen commented 9 months ago

as I understand you want to query a feature table as delta format, spark and trino can help. feast does support both of them

ion-elgreco commented 9 months ago

No I would like to do this without a JVM application. So delta-rs Python bindings (deltalake) can be used to achieve this: https://github.com/delta-io/delta-rs

sudohainguyen commented 9 months ago

cool, we need some changes to extend FileSource to read delta tables, do you mind contributing?

ion-elgreco commented 9 months ago

Sure, if you can give me some pointers : )

tokoko commented 9 months ago

@ion-elgreco Let me try to give you a quick rundown of options how the integration might look like. First of all, The concept closest to backend in feast is an OfflineStore, but offline store implementations don't just specify the sources and how they should be read, they also implement additional logic on top of it (point-in-time join between entity dataframe and feature tables). That's why it's unlikely that we can have a deltalake offline store implementation as there's no way to specify data transformations with deltalake. The closest thing to what you're looking for is probably a polars implementation (it's using delta-rs if i'm not mistaken, right?) or something like duckdb that can be extended to use delta-rs for working with delta tables (I already have a draft PR that adds duckdb minus delta #3822).

Feast has another concept called DataSource. This is how you specify the sources that offline stores will have to read later on. The implementation you might be interested in is FileSource as @sudohainguyen pointed out, that allows users to specify file format, but currently only parquet format is supported. So the first logical step should be to extend FileSource to allow users to specify delta as a file format. Once we have that, we can teach various offline store implementations (jvm-based or otherwise) how to read them.

ion-elgreco commented 9 months ago

@tokoko gotcha, that helps! Since I mainly use Polars I will look into adding that as an offline store and then add delta as additional filesource using deltalake as dependency.

Yup Polars uses deltalake to read and write.

tokoko commented 9 months ago

Glad to be able to help. One more pointer that may help you out, but note that this my preferred direction that I'm trying to push (but with not much luck as of yet :) ). Despite your preference for polars, you should probably still check out duckdb PR I linked above. The actual offline store implementation is written using ibis rather than duckdb directly. As ibis has a fairly good polars backend, you could easily reuse the same ibis implementation. In that case, polars implementation might be just a single line code change (probably not but something close to that).

sudohainguyen commented 9 months ago

Great explaination @tokoko ! Looking forward to seeing changes