Feast s3 support - Githubissues

absingh-coursera commented 1 month ago

Is your feature request related to a problem? Please describe. Hi Team, I am trying to use feast as an alternative to sagemaker feature store but here are some constraint due t which I am raising this issue

We right now using redshift + databricks as our dataware house and moving slowly towards databricks as only ground truth
due to certain reason I can't use redshift and don't want to use rds as we don't really use it and just using and maintaining it for feature store is not what I am looking forward.

Describe the solution you'd like There are two ideal solution I would like to see

S3 as offline store but not like just using single file but more like standard one where feast is maintain partition of primary data and not like just using a single file.

Describe alternatives you've considered

since there are not many open source feature stores so our best shot is with feast or start using databricks feature store.

Additional context Most of the the context I have cleared above if there are any questions I am happy to answer.

tokoko commented 1 month ago

@absingh-coursera Hey, I think you might be somewhat confusing the concepts of DataSource and OfflineStore here. Offline stores in feast are engines (not storage) by which feast gets the offline datasets, joins them and produces the training set. s3 alone can't be an offline store implementation as you can't do data transformations in s3. There are currently 2 ways you can use s3 to store offline features:

You can use FileSource to point to s3 folders, but FileSource data source type can currently be queried by duckdb and dask offline stores only.
You can use SparkSource which is a generic data source for SparkOfflineStore. as long as you are able to configure spark session to access s3 locations, you can accomplish the same thing with it. If you're trying to use feast in databricks, this is probably the best way to go for you.

absingh-coursera commented 1 month ago

@tokoko so this would be the entire flow -

I built feature view with SparkOfflineSource and points it towards a folder containing partioned parquet files.
For offline data ingestion I just dump additional files in the same s3 location
For online data ingestion after above step I materialize, for online I am using dynamodb aws

This seems pretty clear, couple of questions -

So how does feast knows new data has been ingested in offline store ? when I call Materialize if goes to same folder and checks for latest update right ?
since I want to maintain individual folders for individual feature views will it affect feast feature gathering ? while building training set during offline feature retrival.

tokoko commented 1 month ago

I built feature view with SparkOfflineSource and points it towards a folder containing partioned parquet files.

Yes, today you need to use SparkSource for this. We plan to add FileSource support to spark offline store as well in the future. It will behave identically, with the only difference being that with FileSource you will no longer be bound to spark offline store only. You can have some feast processes running with spark on databricks and some other processes elsewhere with duckdb or dask.

So how does feast knows new data has been ingested in offline store ? when I call Materialize if goes to same folder and checks for latest update right ?

When you use materialize, you are the one who provides lower and upper bound of event_timestamp column to acquire the dataset from offline. In case you use incremental materialization, then feast stores last upper bound in the registry and uses that for the next run. (docs)

since I want to maintain individual folders for individual feature views will it affect feast feature gathering ? while building training set during offline feature retrival.

Not sure I get the question. Each table behind a feature view needs to be in a separate "folder" in s3, of course.

absingh-coursera commented 1 month ago

thanks @tokoko this makes it much clear.

tokoko commented 1 month ago

you're welcome. I'll go ahead and close this then.

feast-dev / feast

Feast s3 support #4397