feast-dev / feast

The Open Source Feature Store for Machine Learning
https://feast.dev
Apache License 2.0
5.48k stars 977 forks source link

Feast s3 support #4397

Closed absingh-coursera closed 1 month ago

absingh-coursera commented 1 month ago

Is your feature request related to a problem? Please describe. Hi Team, I am trying to use feast as an alternative to sagemaker feature store but here are some constraint due t which I am raising this issue

Describe the solution you'd like There are two ideal solution I would like to see

Describe alternatives you've considered

Additional context Most of the the context I have cleared above if there are any questions I am happy to answer.

tokoko commented 1 month ago

@absingh-coursera Hey, I think you might be somewhat confusing the concepts of DataSource and OfflineStore here. Offline stores in feast are engines (not storage) by which feast gets the offline datasets, joins them and produces the training set. s3 alone can't be an offline store implementation as you can't do data transformations in s3. There are currently 2 ways you can use s3 to store offline features:

absingh-coursera commented 1 month ago

@tokoko so this would be the entire flow -

This seems pretty clear, couple of questions -

tokoko commented 1 month ago

I built feature view with SparkOfflineSource and points it towards a folder containing partioned parquet files.

Yes, today you need to use SparkSource for this. We plan to add FileSource support to spark offline store as well in the future. It will behave identically, with the only difference being that with FileSource you will no longer be bound to spark offline store only. You can have some feast processes running with spark on databricks and some other processes elsewhere with duckdb or dask.

So how does feast knows new data has been ingested in offline store ? when I call Materialize if goes to same folder and checks for latest update right ?

When you use materialize, you are the one who provides lower and upper bound of event_timestamp column to acquire the dataset from offline. In case you use incremental materialization, then feast stores last upper bound in the registry and uses that for the next run. (docs)

since I want to maintain individual folders for individual feature views will it affect feast feature gathering ? while building training set during offline feature retrival.

Not sure I get the question. Each table behind a feature view needs to be in a separate "folder" in s3, of course.

absingh-coursera commented 1 month ago

thanks @tokoko this makes it much clear.

tokoko commented 1 month ago

you're welcome. I'll go ahead and close this then.