[RMP] Support pre-trained vector embeddings as input features into a model via the dataloader

karlhigley commented 2 years ago

### Tasks
- [ ] Add a draft title or issue reference here

Problem:

Customers need a way to load embeddings that have been pretrained or trained from separate models into the model. See https://github.com/NVIDIA-Merlin/Merlin/issues/471

Goal:

Enable dataloading of separate embedding tables without having to add these embeddings into the interaction data during training. For serving those embeddings need to be provided in the request to the model. The feature must be ueseable in production setting

Constraints:

[ ] External embedding tables may not fit on GPU.
[ ] Non-trainable embeddings
[ ] Fits in CPU memory, Larger than CPU memory is left for potential future work
[ ] Not generating the embedding on the fly (future work)

Supporting pre-trained vector embeddings as features would provide baseline support for multi-modal use cases that rely on outside models to generate image/text embeddings.

NVTabular

[x] https://github.com/NVIDIA-Merlin/NVTabular/pull/1692
[x] Update the c++ versions of categorify serving to match the new functionality
[x] https://github.com/NVIDIA-Merlin/NVTabular/issues/1748
[x] #972
[x] #971
[ ] Feed pre-trained embeddings to NVTabular Is this part of this RMP ticket?

Core

[x] NVIDIA-Merlin/core#238

Dataloader

[x] NVIDIA-Merlin/dataloader#31
[x] NVIDIA-Merlin/dataloader#32
[x] NVIDIA-Merlin/dataloader#34
[ ] Modify the padding operator to only allow padding values of 0 (in conjunction with the changes to categorify)

Transformers4Rec

These features under T4R will not be in scope for this RMP ticket. The development will happen in Models. PR implementing pre-trained support in T4Rec: https://github.com/NVIDIA-Merlin/Transformers4Rec/pull/690

[x] NVIDIA-Merlin/Transformers4Rec#682
[x] NVIDIA-Merlin/Transformers4Rec#683
[x] NVIDIA-Merlin/Transformers4Rec#684
[x] NVIDIA-Merlin/Transformers4Rec#685
[x] NVIDIA-Merlin/Transformers4Rec#500
[x] https://github.com/NVIDIA-Merlin/Transformers4Rec/issues/485

Models (TF API)

PR #1083 implementing pre-trained support in MM

[x] NVIDIA-Merlin/models#1071
[x] NVIDIA-Merlin/models#1068
[x] NVIDIA-Merlin/models#1073
[x] NVIDIA-Merlin/models#1070
[x] NVIDIA-Merlin/models#1072

Merlin Systems

[x] NVIDIA-Merlin/systems#210

Examples

[x] NVIDIA-Merlin/Transformers4Rec#501
[x] NVIDIA-Merlin/models#788
[x] NVIDIA-Merlin/Merlin#886
[x] https://github.com/NVIDIA-Merlin/NVTabular/pull/1827 ( This ticket closes all NVT work )

Documentation

[x] #1000
[ ] #1001

radekosmulski commented 2 years ago

Ok, this issue now makes much more sense to me 🙂 I created a PR NVIDIA-Merlin/models#508 but I think this is just a tiny step on this. Not sure what would be the logical next step here.

I certainly need to continue to bring myself up to speed with Merlin Models, I still only have a very narrow understanding of all the components and how they fit together, but regardless, I wonder what the next steps on this could be? @karlhigley, if you could offer a suggestion, that would be greatly appreciated 🙂 This is my first run-in with an RMP issue

karlhigley commented 2 years ago

I'm honestly not entirely sure either! I captured this issue because I heard you were already working on it, but it's mostly a placeholder for a discussion on the scope of what we'd want to do and where that falls in terms of our team priorities. I don't think we've had that conversation yet, and I'm not entirely sure how/where it would happen either (given time zones etc.)

karlhigley commented 2 years ago

I put your face on it less to signal that you're responsible for the whole thing (I don't think you are), and more to signal that you'd be the person who is already doing relevant work and probably would have worthwhile thoughts about what we ought to be able to do with pre-trained embeddings.

radekosmulski commented 2 years ago

Thank you very much @karlhigley for these thoughts, they are very helpful! 🙂 Makes a lot of sense.

JUst wanted to reference NVIDIA-Merlin/models#508 -- we now have a use case for using pretrained embeddings, but don't have a good way of freezing them I believe. Would be very good to have this option as this is what likely most users would want.

rnyak commented 2 years ago

@EvenOldridge @karlhigley we now have an example for using pre-trained embeddings in MMs, and have a way of freezing them. fyi.

EvenOldridge commented 2 years ago

https://github.com/NVIDIA-Merlin/Merlin/issues/471 has details on the customer request side.

rnyak commented 2 years ago

NVIDIA-Merlin/Merlin#471 has details on the customer request side.

@EvenOldridge yes we need this for TF4Rec. And I created this ticket https://github.com/NVIDIA-Merlin/Transformers4Rec/issues/475 for that.

karlhigley commented 2 years ago

@EvenOldridge If I'm understanding correctly, it sounds like the underlying customer request involves the dataloaders, the T4R library itself, and Merlin Systems (but not NVT.) Would it make sense to scope this issue more tightly to the customer request and punt additional features to a subsequent issue?

karlhigley commented 2 years ago

It also sounds like the customer request necessarily involves having PyTorch serving for T4R worked out. Assuming that the (known-to-be-slow) Python serving isn't sufficient, sounds like we'll need to work out the issues with Torchscript serving.

rhdong commented 2 years ago

To my best knowledge, TensorFlow has a warm start mechanism as a similar function. I think they have a meaningful design; maybe we can be inspired by it: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/warm_starting_util.py#L419 I know some end-users are using these APIs for pre-training, and the regular expression can give the user more convenience.

bschifferer commented 2 years ago

ToDo: How to integrate pre-trained embedding in schema file (tagging) and is used in architecture definition

karlhigley commented 2 years ago

How to integrate pre-trained embedding in schema file (tagging)

Adding Tags.EMBEDDING as a "prefab" tag in the Merlin Core schema implementation seems like it could make sense 👍🏻

gabrielspmoreira commented 1 year ago

When the embedding table are not huge and fit GPU memory, the new PretrainedEmbeddingsInitializer ( https://github.com/NVIDIA-Merlin/Transformers4Rec/pull/572 ) can be used to initialize the embedding matrix with pre-trained embeddings and set them to trainable or not.

bschifferer commented 1 year ago

I am not sure, if the main ticket is uptodate. In some meetings, we say, that the feature is almost done but there are many Tickets not checked (finished). I looked into pre-trained embedding functionality of the dataloader and tried to provide a simple example for a minimal definition of done. That doesn't mean, that this simple example represents the definition of done - that's how I imagine to use this feature.

I did only looked at the TensorFlow side and haven't tested the PyTorch side (assuming it works the same)?

Open ToDos (from my point of view):

BUG: Get an key error when combining target with pre-trained embeddings : KeyError: 'target'
BUG: Sequence Features are not embedded correctly
FEATURE: Convert input columns to emb_ids ( nvt.ops.LambdaOp(lambda x: x.map(emb1_map)) ) - this is similar to a request with have for GTC Recommender. I am not sure, if we want to do this in NVTabular OR if we apply this mapping in the dataloader
FEATURE: Merlin Models needs to use pre-trained embedding in the model architecture and use it for training. This should work for ranking models, retrieval models and session-based models. (For special architectures, such as DLRM, it should throw a meaningful error, if the pre-trained embedding do not fit)
FEATURE: Transformers4Rec needs to use pre-trained embedding in the model architecture and use it for training.
~FEATURE: Schema Object needs to represent the pre-trained embedding functionality, that MM and Transformers4Rec knows that this feature is a pre-trained embedding (more below)~ -> does already exist with dataloader.output_schema
[FEATURE: (Not sure if it does already exist) - provide the embeddings during serving]

I will explain more my assumptions and proposed open ToDos:

My assumption is that the user have a downstream process to generate embeddings (np_emb1 and np_emb2). I am not sure, if we can assume that the IDs in the dataset are matching the order of the numpy arrays. I assume there will be mapping tables to convert them (emb1_map and emb1_map). Either in NVT or dataloader, we should provide the functionality to map the input data to the IDs of the pre-trained embeddings.
~MM and Transformers4Rec defines the neural network architecture. They rely on the schema object. The current usability to set pre-trained embeddings in the dataloader as transforms do not modify the schema object. Therefore, MM and Transformers4Rec cannot know that they expect pre-trained embeddings. We need to modify the schema object to make this change aware. PROPOSAL (see code comments): We add the information to the schema object (e.g. schema['emb_id_1'].add(PreTrain(np_emb1, lookup_key='emb_id_1', embedding_name='emb_id_1'). It would be great, if we do not need to repeat the information in the dataloader (however, we cannot store the numpy object in the schema, so I guess, we need at least to provide the numpy object to the dataloader).~

BUGs:

If you uncomment #>> nvt.ops.AddMetadata(tags=[Tags.BINARY_CLASSIFICATION, Tags.TARGET]), next(iter(dataloader)) will fail

import os

os.environ["CUDA_VISIBLE_DEVICES"]="1"

import glob

from merlin.io import Dataset
from merlin.loader.tensorflow import Loader
from merlin.schema import Tags
from merlin.schema.tags import Tags

import numpy as np
import pandas as pd

import nvtabular as nvt
import merlin.models.tf as mm

import cudf

from merlin.dataloader.ops.embeddings import (  # noqa
    EmbeddingOperator,
    MmapNumpyEmbedding,
    NumpyEmbeddingOperator,
)

### Input
np_emb1 = np.random.rand(1000,10)
np_emb2 = np.random.rand(1000,20)
emb1_map = {
    10: 0,
    11: 1,
    12: 2,
    13: 3
}
emb2_map = {
    'a': 0,
    'b': 1,
    'c': 2,
    'd': 3
}
df = cudf.DataFrame({
    'emb_id_1': [10, 12, 11, 12, 11, 13],
    'emb_id_2': ['a', 'd', 'c', 'a', 'd', 'b'],
    'cat1': [1,5,6,3,5,7],
    'cat2': ['a', 'a', 'd', 'e', 'f', 'g'],
    'target': [0,1,1,0,1,0]
})

# NVTabular Workflow
emb1 = ['emb_id_1'] >> nvt.ops.LambdaOp(lambda x: x.map(emb1_map)) >> nvt.ops.AddTags([Tags.CATEGORICAL])
emb2 = ['emb_id_2'] >> nvt.ops.LambdaOp(lambda x: x.map(emb2_map)) >> nvt.ops.AddTags([Tags.CATEGORICAL])
cats = ['cat1', 'cat2'] >> nvt.ops.Categorify()
target = ['target'] #>> nvt.ops.AddMetadata(tags=[Tags.BINARY_CLASSIFICATION, Tags.TARGET])

features = emb1+emb2+cats+target
workflow = nvt.Workflow(features)

ds = Dataset(df)
workflow.fit(ds)
ds_transformed = workflow.transform(ds)
ds_transformed.compute()

data_loader = Loader(
    ds_transformed,
    batch_size=2,
    transforms=[
        NumpyEmbeddingOperator(
            np_emb1,
            lookup_key='emb_id_1',
            embedding_name='emb_id_1'
        ), 
        NumpyEmbeddingOperator(
            np_emb2, 
            lookup_key='emb_id_2',
            embedding_name='emb_id_2'
        )
    ],
    shuffle=False,
)
next(iter(dataloader))
model = mm.Model.from_block(
    mm.MLPBlock([64, 32]),
    data_loader.output_schema, 
    prediction_tasks=mm.BinaryOutput('target')
)
model.compile()
model.fit(data_loader)

Session-Based Bug: I do not know, if session-based is in the scope (given that Transformers4Rec is mentioned, I guess yes?). Although there are only 2x examples, the emb tensors is [6, 10] - it does not keep the sequential structure. I do not know what the representation is, but I think we might need to convert it to values and offsets (and the offsets are missing)?

emb = np.random.rand(1000,10)
df = cudf.DataFrame({
    'idx': [0,1,2,3,4,5,6,7,8,9],
    'id1': [[0, 1], [1,2,3,4],[2],[3],[4],[5],[6],[8],[9],[10]]
})

dataset = Dataset(df)
schema = dataset.schema
for col_name in ['id1']:
    schema[col_name] = schema[col_name].with_tags(Tags.CATEGORICAL)
dataset.schema = schema
embeddings_np = emb
data_loader = Loader(
    dataset,
    batch_size=2,
    transforms=[NumpyEmbeddingOperator(
        embeddings_np, 
        lookup_key='id1',
        embedding_name='emb'
    )],
    shuffle=False,
)
next(iter(data_loader))

viswa-nvidia commented 1 year ago

@sararb to update this ticket

NVIDIA-Merlin / Merlin