Open hzding621 opened 7 months ago
def runModelInference(join: Join, inputs: Map[String, AnyRef]): Future[Map[String, AnyRef]]
This should be instead a batch / multi method
def runModelInference(join: Join, inputs: Seq[Map[String, AnyRef]]): Future[Seq[Map[String, AnyRef]]]
CHIP-9: Support Model-based Transformations in Join & Chaining
Problem Statement
Model Inference is an important primitive form of transform function that ML practitioners use in creating feature pipelines. The most popular example is embeddings as ML features, where the output of an Embedding Model (usually a DNN model) is used as input features for a downstream model. Once created (trained), the (Embedding) Model can be treated as a special case of the general row-to-row transformation function, and can be plugged anywhere into a feature pipeline.
This CHIP will add support for model-based transformations in Chronon via extension to Join API. This, combined with existing chaining support for join, will allow Chronon users to build complex feature pipelines that include model-based transformation. Of course this will also cover both offline backfills and online serving just like regular joins.
Requirements
Non-Requirements
Join API Changes
We call this the Model Transform API, which extends the current Chronon join to introduce a new model_transform section that handles model inference after the generation of raw feature data. Model transform takes place after the execution of join logic completes, and acts as another round of transformation which calls into an external model inference (either batch or online) engine to retrieve the model inference output.
Model
Core to the Model Transform API is the
Model
definition. AModel
contains all parameters required to invoke a model backend to run model inference. Note that the model backend implementation will not live in Chronon open source but in an impl-specific code base. The responsibility of Chronon here is to handle the end-to-end orchestration from raw data into final features. It also serves as the central repository for feature metadata.model_backend
: this refers to a specific model backend that will be in the Impl. This is similar to the name of anExternalSourceHandler
inExternalPart
APImodel_backend_params
: this is used to store any params needed by model backend to locate the model instance and know how to run inference against it.Model Transform
We will introduce
ModelTransform
as another section in the Join definition. During orchestration, this step runs after derivations and its output becomes the new join output.ModelTransform
contains the coreModel
definition, as well as some additional join-level parameters for mappings and formatting:Model Backend APIs
Model backend will need to implement the following APIs, which Chronon will invoke during orchestration.
Orchestration Topology
Metadata Registration and Validation
Join Backfill
Join Fetching
Orchestration Details
Join Level Operations
Analyzer
Join Backfill
Fetcher
fetchJoin
that:Model Metadata Upload
Group By Level Operations (for Chaining)
Below are related to Chaining, where the output of a Join with Model Transform is used as a JoinSource in a downstream GroupBy, which can be either a batch GroupBy or a streaming GroupBy.
Group By Upload
Group By Streaming