fal-ai / dbt-fal

do more with dbt. dbt-fal helps you run Python alongside dbt, so you can send Slack alerts, detect anomalies and build machine learning models.
https://fal.ai/dbt-fal
Apache License 2.0
849 stars 72 forks source link

[Design Doc] fal-dbt feature store #91

Open turbo1912 opened 2 years ago

turbo1912 commented 2 years ago

What are we building?

A feature store is data system that facilitates managing the data transformations centrally for predictive analysis and ML models in production.

fal-dbt feature store is a feature store implementation that consists of a dbt package and a python library.

Why are we doing this?

Empower analytics engineer: ML models and analytics operate on the same data. Analytics engineers know this data inside out. They are the ones setting up metrics, ensuring data quality and freshness. Why shouldn’t they be the ones responsible for the predictive analysis? With the rise of open source modelling libraries most of the work that goes into an ML model is done on the data processing side.

Leverage the Warehouse: Warehouses are secure, scalable and relatively cheap environments to do data transformation. Doing transformations in other environments is at least an order of magnitude more complicated. Warehouse should be part of the ML engineer toolkit especially for batch predictions. dbt is the best tool out there to do transformations with the warehouse. dbt feature store will make ML workflows leverage all the advantages of the modern data warehouses.

Strategy

The first building block for the fal feature store is the fal-dbt cli tool. Using fal-dbt cli, dbt users are able perform various tasks via python scripts after their dbt workflows.

✅ Milestone 1: Add ability to read feature store config from dbt ymls

✅ Milestone 2: Run create_dataset from the fal dbt python client

✅ Milestone 3: Move feature to online store and provide online store client

Aready Possible with fal-dbt cli

✅ Milestone 4: Add ability to etl data from a fal script

✅ Milestone 5: Model Monitoring

Stretch Goals

⭐️ Milestone: Logged real time models

Online/Offline Predictions vs Logged Features

There are roughly 3 types of ML systems in terms of complexity; offline predictions, online predictions with batch features and online predictions with real-time features. Most of the use cases we saw were also in the same order, "online predictions with real-time features" being the least common.

A warehouse can handle all the feature calculations for offline use cases, combined with the firestore reverse etl we can also handle online predictions with batch features. This leaves out "online predictions with real-time features" which is out of scope for the initial implementation. We plan on tackling that with logged features as a stretch goal.

Implementation

Feature Definitions

Feature store configurations are added under model configurations as part of the fal meta tag. Each feature is required to have an entity_id and a timestamp field.

entity_id and timestamp fields are later used for the point in time join of a list of features and a label.

Optionally feature definitions can include fal scripts for downstream workflows. For example the dbt model below includes a make_avaliable_online.py (link to example) script. A typical etl step that moves the latest values of features from the data warehouse to an OLTP database.

## schema.yml
models:
  - name: bike_duration
    columns:
      - name: trip_count_last_week
      - name: trip_duration_last_week
      - name: user_id
      - name: start_date
    meta:
      fal:
        feature_store:
          entity_id: user_id
          timestamp: start_date
        scripts:
          - make_avaliable_online.py

A label is also a defined as a feature using the configuration above. fal-dbt feature doesn’t have any requirements or assumptions on what constitutes a label.

Create Dataset

A feature store configuration doesn’t have any effect on your infrastructure unless it is used in a dataset calculation. A dataset in fal-dbt feature store is a dataframe that includes all the features and the label for the machine learning model being built.

There are two ways to create a dataset.

Creating a dataset with dbt macro:

// dataset_name.sql
SELECT
    *
FROM
    {{ feature_store.create_dataset(
        features = ["total_transactions", "credit_score"],
        label_name = "credit_decision"
    ) }}

This model can later be referred in a fal script:

df = ref("dataset_name") 

Creating a dataset with python:

from fal.ml.feature_store import FeatureStore

store = FeatureStore(creds="/../creds.json") // path to service account

ds = store.create_dataset(
    name="dataset_name",
  features=["total_transactions", "credit_score"], 
  label="credit_decision"
)

df = ds.get_pandas_dataframe()

Python Client

class FeatureStore

    def create_dataset(dataset_name: str, features: List[str], label: str)

    def get_dataset(dataset_name: str)
@dataclass
class OnlineClient
    client_config: ClientConfig

    def get_feature_vector(dbt_model: str, feature_name: str)

    def get_feature_vectors(feature_list: List[Tuple[str, str]])

Scheduling

Scheduling is usually an afterthought in existing feature store implementations. It is left to the users to handle using tools like Airflow. fal-dbt feature store’s close integration with dbt offloads scheduling responsibilities to the dbt scheduler.

Incremental Calculations

dbt incremental calculations make sure feature calculations are not wasteful, they can be incrementally calculated, and always fresh if scheduled properly with the dbt scheduler. In fal-dbt feature store there are no lazy feature calculations all features are assumed to be fresh.

Stretch Goals

Logged Features

We have talked about this before but we never had a clear design on how we would achieve this. This fits very well with the "do the simple thing first" tenant mentioned above. Logged features achieve real time transformations by transforming the data with the application code and then storing the transformed version in the data warehouse for training. This enables transformation logic to live in just one place (application code) and not duplicated in the warehouse and application. Not only does it live in the application code, it's also written with the web stack where applying business logic is easier with the help of an ORM or similar.

This is almost too good to be true but problems start to emerge when the transformation code starts changing over time. Once a change is made in the application code, the training data still has the shape of the older data. Model has to be retrained, but also older data needs to be back-filled - just one time- to apply the new transformation. This is not ideal but better than maintaining two code-bases.

How can we build tools to make this easier?

sungchun12 commented 2 years ago

Love love love this. Let me know how I can help :)

turbo1912 commented 2 years ago

👀 https://github.com/fal-ai/dbt_feature_store 👀