NVIDIA-Merlin / Merlin

NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.
Apache License 2.0
715 stars 111 forks source link

[RMP] Iterative Recommender System Updating #672

Open EvenOldridge opened 1 year ago

EvenOldridge commented 1 year ago

Problem:

Many production recommenders aren't trained from a fixed dataset but are updated using the existing model as the basis and iteratively training on the most recent data. In order to support this a number of changes are needed to the Merlin ecosystem.

Goal:

Provide support for iteratively training recommender systems.

New Functionality

Constraints:

Starting Point:

In several recent competitions the research/kgmon team used this technique to keep the model more up to date. This should provide us with a template for how to do updating of the model on the training side.

NVTabular is likely to be a major effort requiring the updating of all ops, but once one or two ops are complete most should be able to follow that template. The exception will be Categorify, which will require a total rewrite.

Systems handling streaming data for keeping NVT stats up to date is entirely new functionality and needs to be scoped.

karlhigley commented 1 year ago

This roadmap ticket seems ambitious and somewhat excessively large for the goal stated in the title, since I don't think regularly retraining models requires online updates. It's not that the full scope as described here isn't desirable, it just doesn't all have to be part of the same feature/release/ticket in order to have value for our customers.

If it were me, I'd split it like this, so that the first roadmap issue has immediate customer value and every subsequent issue provides upgraded functionality with incremental value:

1) The dirt simple approach: Use the same NVT feature engineering Workflow graph for every iteration, but fit it to a new batch of training data. Transform the training data with the updated Workflow and train a new model on it from scratch. Deploy the new workflow and model(s) by exporting using an existing Systems ensemble graph but swapping out the saved artifacts. (No NVT or Models changes required, minor Systems changes only.)

2) Scheduling: Automate the above using MLops tooling (e.g. Airflow, Metaflow, etc) so that it can be triggered on a regular basis. (Systems changes only)

3) Fine-tuning: Extend Merlin Models to make it possible to continue training from the previous model, expanding embedding tables as required. (Merlin Models changes only)

4) Streaming feature updates: Provide feature store integrations that allow streaming updates of e.g. interacted items from serving logs. As long as any NVT workflows and ML/DL models in the serving ensemble match with each other, it doesn't matter if they're completely up to date w.r.t. the item catalog, since they'll encode any new items they don't know about in a user's interacted items as unknown. (No updates to NVT Workflows or Systems ensembles required, new functionality in Systems only)

Someday/maybe:

5) Streaming updates to NVT Workflow statistics. Assuming you're refitting workflows, retraining models, and deploying the new versions on a regular basis, I'm honestly not sure you need this and it sounds like more trouble than it's worth.