Azure / feast-azure

Azure plugins for Feast (FEAture STore)
MIT License
81 stars 52 forks source link

How to load historical features directly into spark dataframe #71

Open VincentPe opened 1 year ago

VincentPe commented 1 year ago

We have been using Feast with a SQL db as an offline store and used JDBC to append features from a Spark dataframe directly to a table in SQL. Now for a recommender we'd like to build a historical dataset to train models on which will use a couple hundred-millions rows. Each is a customer with a timestamp. Feast's get_historical_features only takes a pandas dataframe as entity or a SQL query, so a workaround has been to store the entity df in the SQL db and use the query to fetch the features like so:

sql_job = fs.get_historical_features(
    entity_df="SELECT * FROM test_entitity_df",
    features=[
        'feature_view1:feature1',
        'feature_view1:feature2',
    ]
)

However, the sql_job only has to_df, to_arrow, or persist functionality. My question is, how to load features efficiently into a Spark DF for training? One solution would be to store the result of the Feast query in a sql table and use JDBC again to load that into Spark, however, I cannot seem to get the persist functionality to work as the docs on SavedDatasetStorage is very limited. Please advice.

Resources: https://docs.feast.dev/reference/offline-stores/overview#functionality https://docs.feast.dev/getting-started/concepts/dataset#creating-a-saved-dataset-from-historical-retrieval