NVIDIA-Merlin / Merlin

NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.
Apache License 2.0
755 stars 113 forks source link

[RMP] Add PyTorch backend in Merlin Models #893

Open marcromeyn opened 1 year ago

marcromeyn commented 1 year ago

Problem:

We are currently in a situation where some customers are using merlin-models & some T4Rec to train models. The APIs of these 2 tools have diverged quite dramatically and some features (like extracting embeddings out of models) are only supported in Merlin Models. Both tools require some work in order to have easy to use APIs.

On the Merlin models side, we are in a in-between state where (because of time pressure) there are a bunch of V1 & V2 classes. We would like to migrate all our users to the V2 classes (while removing V2 from the name) & deprecate the old classes.

On the T4Rec side, we would like to keep using this project for session-based models in PyTorch because of the traction we've got. The idea would be to break out the core model-building parts (block-API) in favor of the pytorch-backend of Merlin Models. This roadmap-level ticket focusses on this new pytorch-backend, integration into T4Rec is left out for later. The first major deliverable of this backend is the creation of retrieval models, this because we typically frame session-based models as retrieval-models

Goal:

Reach feature parity & rough API parity between TF & PyTorch backends in Merlin models. This roadmap ticket will be around PyTorch, a future roadmap ticket will focus on TF.

New Functionality

Constraints:

Starting Point:

In order to properly plan out the work, a dev-branch is created to answer various design-questions around being able to create retrieval-models in PyTorch. This has lead to a rough MVP that contains all the major pieces. This has also given us a better idea how to break things down to turn the MVP into a fully fleshed product.

We are planning to have people work in parallel on 4 different major parts: inputs, outputs, models & masking.

Implement base-classes of block-API in PyTorch

People: @marcromeyn

Currently the block-API is T4Rec is using a similar design to Keras to allow for modules that lazily initialize their variables. We would like to deprecate this in favor of a native way to achieve the same thing that could launched recently.

Masking

People: @sararb, @gabrielspmoreira & @marcromeyn

This work is dependent on answering the design-question how to handle ragged-tensors.

Tasks: TODO

Input-blocks

People: @marcromeyn

PyTorch

Starting point: MVP

Output-blocks

People: @edknv & @marcromeyn

Models

People: @edknv & @marcromeyn

Starting point: MVP

One of the leading questions in the initial experimentation phase was to figure out if we can leverage PyTorch lightning for a high-level training-API (similar to how we use Keras on the TF-side). We are confident that PyTorch Lightning is the right path forward.

Documentation

viswa-nvidia commented 1 year ago

@marcromeyn , please create the tasks for PyT and create the tickets so that we can assign them

EvenOldridge commented 1 year ago

@marcromeyn @gabrielspmoreira can you work to split this up into: Ranking, Retrieval and Session based

marcromeyn commented 1 year ago

Ranking ticket is here: https://github.com/NVIDIA-Merlin/Merlin/issues/1044