Closed karlhigley closed 1 year ago
Ok, this issue now makes much more sense to me π I created a PR NVIDIA-Merlin/models#508 but I think this is just a tiny step on this. Not sure what would be the logical next step here.
I certainly need to continue to bring myself up to speed with Merlin Models
, I still only have a very narrow understanding of all the components and how they fit together, but regardless, I wonder what the next steps on this could be? @karlhigley, if you could offer a suggestion, that would be greatly appreciated π This is my first run-in with an RMP issue
I'm honestly not entirely sure either! I captured this issue because I heard you were already working on it, but it's mostly a placeholder for a discussion on the scope of what we'd want to do and where that falls in terms of our team priorities. I don't think we've had that conversation yet, and I'm not entirely sure how/where it would happen either (given time zones etc.)
I put your face on it less to signal that you're responsible for the whole thing (I don't think you are), and more to signal that you'd be the person who is already doing relevant work and probably would have worthwhile thoughts about what we ought to be able to do with pre-trained embeddings.
Thank you very much @karlhigley for these thoughts, they are very helpful! π Makes a lot of sense.
JUst wanted to reference NVIDIA-Merlin/models#508 -- we now have a use case for using pretrained embeddings, but don't have a good way of freezing them I believe. Would be very good to have this option as this is what likely most users would want.
@EvenOldridge @karlhigley we now have an example for using pre-trained embeddings in MMs, and have a way of freezing them. fyi.
https://github.com/NVIDIA-Merlin/Merlin/issues/471 has details on the customer request side.
NVIDIA-Merlin/Merlin#471 has details on the customer request side.
@EvenOldridge yes we need this for TF4Rec. And I created this ticket https://github.com/NVIDIA-Merlin/Transformers4Rec/issues/475 for that.
@EvenOldridge If I'm understanding correctly, it sounds like the underlying customer request involves the dataloaders, the T4R library itself, and Merlin Systems (but not NVT.) Would it make sense to scope this issue more tightly to the customer request and punt additional features to a subsequent issue?
It also sounds like the customer request necessarily involves having PyTorch serving for T4R worked out. Assuming that the (known-to-be-slow) Python serving isn't sufficient, sounds like we'll need to work out the issues with Torchscript serving.
To my best knowledge, TensorFlow has a warm start mechanism as a similar function. I think they have a meaningful design; maybe we can be inspired by it: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/warm_starting_util.py#L419
I know some end-users are using these APIs for pre-training, and the regular expression
can give the user more convenience.
ToDo: How to integrate pre-trained embedding in schema file (tagging) and is used in architecture definition
How to integrate pre-trained embedding in schema file (tagging)
Adding Tags.EMBEDDING
as a "prefab" tag in the Merlin Core schema implementation seems like it could make sense ππ»
When the embedding table are not huge and fit GPU memory, the new PretrainedEmbeddingsInitializer
( https://github.com/NVIDIA-Merlin/Transformers4Rec/pull/572 ) can be used to initialize the embedding matrix with pre-trained embeddings and set them to trainable or not.
I am not sure, if the main ticket is uptodate. In some meetings, we say, that the feature is almost done but there are many Tickets not checked (finished). I looked into pre-trained embedding functionality of the dataloader and tried to provide a simple example for a minimal definition of done
. That doesn't mean, that this simple example represents the definition of done - that's how I imagine to use this feature.
I did only looked at the TensorFlow side and haven't tested the PyTorch side (assuming it works the same)?
Open ToDos (from my point of view):
target
with pre-trained embeddings
: KeyError: 'target'
nvt.ops.LambdaOp(lambda x: x.map(emb1_map))
) - this is similar to a request with have for GTC Recommender. I am not sure, if we want to do this in NVTabular OR if we apply this mapping in the dataloaderdataloader.output_schema
I will explain more my assumptions and proposed open ToDos:
np_emb1
and np_emb2
). I am not sure, if we can assume that the IDs in the dataset are matching the order of the numpy arrays. I assume there will be mapping tables to convert them (emb1_map
and emb1_map
). Either in NVT or dataloader, we should provide the functionality to map the input data to the IDs of the pre-trained embeddings.transforms
do not modify the schema object. Therefore, MM and Transformers4Rec cannot know that they expect pre-trained embeddings. We need to modify the schema object to make this change aware. PROPOSAL (see code comments): We add the information to the schema object (e.g. schema['emb_id_1'].add(PreTrain(np_emb1, lookup_key='emb_id_1', embedding_name='emb_id_1')
. It would be great, if we do not need to repeat the information in the dataloader (however, we cannot store the numpy object in the schema, so I guess, we need at least to provide the numpy object to the dataloader).~BUGs:
#>> nvt.ops.AddMetadata(tags=[Tags.BINARY_CLASSIFICATION, Tags.TARGET])
, next(iter(dataloader))
will failimport os
os.environ["CUDA_VISIBLE_DEVICES"]="1"
import glob
from merlin.io import Dataset
from merlin.loader.tensorflow import Loader
from merlin.schema import Tags
from merlin.schema.tags import Tags
import numpy as np
import pandas as pd
import nvtabular as nvt
import merlin.models.tf as mm
import cudf
from merlin.dataloader.ops.embeddings import ( # noqa
EmbeddingOperator,
MmapNumpyEmbedding,
NumpyEmbeddingOperator,
)
### Input
np_emb1 = np.random.rand(1000,10)
np_emb2 = np.random.rand(1000,20)
emb1_map = {
10: 0,
11: 1,
12: 2,
13: 3
}
emb2_map = {
'a': 0,
'b': 1,
'c': 2,
'd': 3
}
df = cudf.DataFrame({
'emb_id_1': [10, 12, 11, 12, 11, 13],
'emb_id_2': ['a', 'd', 'c', 'a', 'd', 'b'],
'cat1': [1,5,6,3,5,7],
'cat2': ['a', 'a', 'd', 'e', 'f', 'g'],
'target': [0,1,1,0,1,0]
})
# NVTabular Workflow
emb1 = ['emb_id_1'] >> nvt.ops.LambdaOp(lambda x: x.map(emb1_map)) >> nvt.ops.AddTags([Tags.CATEGORICAL])
emb2 = ['emb_id_2'] >> nvt.ops.LambdaOp(lambda x: x.map(emb2_map)) >> nvt.ops.AddTags([Tags.CATEGORICAL])
cats = ['cat1', 'cat2'] >> nvt.ops.Categorify()
target = ['target'] #>> nvt.ops.AddMetadata(tags=[Tags.BINARY_CLASSIFICATION, Tags.TARGET])
features = emb1+emb2+cats+target
workflow = nvt.Workflow(features)
ds = Dataset(df)
workflow.fit(ds)
ds_transformed = workflow.transform(ds)
ds_transformed.compute()
data_loader = Loader(
ds_transformed,
batch_size=2,
transforms=[
NumpyEmbeddingOperator(
np_emb1,
lookup_key='emb_id_1',
embedding_name='emb_id_1'
),
NumpyEmbeddingOperator(
np_emb2,
lookup_key='emb_id_2',
embedding_name='emb_id_2'
)
],
shuffle=False,
)
next(iter(dataloader))
model = mm.Model.from_block(
mm.MLPBlock([64, 32]),
data_loader.output_schema,
prediction_tasks=mm.BinaryOutput('target')
)
model.compile()
model.fit(data_loader)
Session-Based Bug: I do not know, if session-based is in the scope (given that Transformers4Rec is mentioned, I guess yes?). Although there are only 2x examples, the emb tensors is [6, 10] - it does not keep the sequential structure. I do not know what the representation is, but I think we might need to convert it to values and offsets (and the offsets are missing)?
emb = np.random.rand(1000,10)
df = cudf.DataFrame({
'idx': [0,1,2,3,4,5,6,7,8,9],
'id1': [[0, 1], [1,2,3,4],[2],[3],[4],[5],[6],[8],[9],[10]]
})
dataset = Dataset(df)
schema = dataset.schema
for col_name in ['id1']:
schema[col_name] = schema[col_name].with_tags(Tags.CATEGORICAL)
dataset.schema = schema
embeddings_np = emb
data_loader = Loader(
dataset,
batch_size=2,
transforms=[NumpyEmbeddingOperator(
embeddings_np,
lookup_key='id1',
embedding_name='emb'
)],
shuffle=False,
)
next(iter(data_loader))
@sararb to update this ticket
Problem:
Customers need a way to load embeddings that have been pretrained or trained from separate models into the model. See https://github.com/NVIDIA-Merlin/Merlin/issues/471
Goal:
Enable dataloading of separate embedding tables without having to add these embeddings into the interaction data during training. For serving those embeddings need to be provided in the request to the model. The feature must be ueseable in production setting
Constraints:
Supporting pre-trained vector embeddings as features would provide baseline support for multi-modal use cases that rely on outside models to generate image/text embeddings.
NVTabular
Core
Dataloader
Transformers4Rec
These features under T4R will not be in scope for this RMP ticket. The development will happen in Models. PR implementing pre-trained support in T4Rec: https://github.com/NVIDIA-Merlin/Transformers4Rec/pull/690
Related PR: https://github.com/NVIDIA-Merlin/Transformers4Rec/pull/690
Models (TF API)
PR #1083 implementing pre-trained support in MM
Merlin Systems
Examples
Documentation