[RMP] Iterative Recommender System Updating

NVIDIA-Merlin / Merlin

NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.

Apache License 2.0

773 stars 118 forks source link

New Functionality

Models

iterative update of existing model weights by fine tuning on a new smaller dataset
ability to add new embeddings to the model incrementally

Transformers4Rec

N/A for now

NVTabular

All existing ops need to be modified to capture statistics necessary for iterative updating
All existing ops need to be modified to allow for iterative updating based on a new dataset and the aforementioned statistics
A subset of existing ops need to effectively operate in a streaming fashion to be able to dynamically capture and update their statistics.

Systems

Streaming ingest of user interactions to the ops workflow to keep statistics up to date
Trigger of training loops based on the collection of new data

Starting Point:

In several recent competitions the research/kgmon team used this technique to keep the model more up to date. This should provide us with a template for how to do updating of the model on the training side.

NVTabular is likely to be a major effort requiring the updating of all ops, but once one or two ops are complete most should be able to follow that template. The exception will be Categorify, which will require a total rewrite.

Systems handling streaming data for keeping NVT stats up to date is entirely new functionality and needs to be scoped.

This roadmap ticket seems ambitious and somewhat excessively large for the goal stated in the title, since I don't think regularly retraining models requires online updates. It's not that the full scope as described here isn't desirable, it just doesn't all have to be part of the same feature/release/ticket in order to have value for our customers.

If it were me, I'd split it like this, so that the first roadmap issue has immediate customer value and every subsequent issue provides upgraded functionality with incremental value:

1) The dirt simple approach: Use the same NVT feature engineering Workflow graph for every iteration, but fit it to a new batch of training data. Transform the training data with the updated Workflow and train a new model on it from scratch. Deploy the new workflow and model(s) by exporting using an existing Systems ensemble graph but swapping out the saved artifacts. (No NVT or Models changes required, minor Systems changes only.)

2) Scheduling: Automate the above using MLops tooling (e.g. Airflow, Metaflow, etc) so that it can be triggered on a regular basis. (Systems changes only)

3) Fine-tuning: Extend Merlin Models to make it possible to continue training from the previous model, expanding embedding tables as required. (Merlin Models changes only)

4) Streaming feature updates: Provide feature store integrations that allow streaming updates of e.g. interacted items from serving logs. As long as any NVT workflows and ML/DL models in the serving ensemble match with each other, it doesn't matter if they're completely up to date w.r.t. the item catalog, since they'll encode any new items they don't know about in a user's interacted items as unknown. (No updates to NVT Workflows or Systems ensembles required, new functionality in Systems only)

Someday/maybe:

5) Streaming updates to NVT Workflow statistics. Assuming you're refitting workflows, retraining models, and deploying the new versions on a regular basis, I'm honestly not sure you need this and it sounds like more trouble than it's worth.

NVIDIA-Merlin / Merlin

[RMP] Iterative Recommender System Updating #672

Problem:

Goal:

New Functionality

Constraints:

Starting Point: