NVIDIA-Merlin / models

Merlin Models is a collection of deep learning recommender system model reference implementations
https://nvidia-merlin.github.io/models/main/index.html
Apache License 2.0
262 stars 50 forks source link

[QST] Performance differences between `encode()` vs `__call__()` on tf Encoder block in CPU #1213

Open lecardozo opened 1 year ago

lecardozo commented 1 year ago

❓ Questions & Help

What is the preferred way of generating predictions from a trained Encoder from a TwoTowerModelV2? There seem to be at least two ways of doing that, with apparently huge performance differences.

Details

After training a TwoTowerModelV2 I noticed that there is a huge difference in performance between calling the model.query_encoder.encode() method of each tower versus calling it directly model.query_encoder() on a single node with CPU.

Setup

import pandas as pd
import nvtabular as nvt

# Encoder
query_encoder = trained_two_tower_model.query_encoder

# Raw Features
features = pd.DataFrame(...)

# Transformed features with nvt.Worflow
query_preprocessor = workflow.get_subworkflow("query_preprocessor")
data = nvt.Dataset(features, schema=self._user_schema)
transformed_data = query_preprocessor.transform(data)

Calling encode()

This takes more >1 hour on 434457 rows. Resource usage metrics show that the CPU is idle most of the time, which is quite unexpected.

outputs = query_encoder.encode(transformed_data, batch_size=1024, index=Tags.USER_ID).compute()

Tried increasing the number of partitions of the transformed dataset and set the .compute(scheduler='processes') to benefit from Dask's parallelization, but it didn't work (failed with serialization issues)

Calling __call__() with Loader

This takes ~30 seconds on 434457 rows. As my data fits into memory, this ended up being a clear winner.

outputs = []
for inputs, _ mm.Loader(transformed_data, batch_size=1024, shuffle=False):
    outputs.append(query_encoder(inputs))

output = np.concatenate(outputs)

Is this difference expected or am I doing something wrong?

rnyak commented 1 year ago

@lecardozo you can check out the Generate top-K recommendations section in this example nb showcasing how to generate topK recommendations for a given batch. and you can loop over the batches and then concat the outputs.

lecardozo commented 1 year ago

Thanks for the answer @rnyak!

Sorry, I think I wasn't clear before. I'm looking specifically for a way of generating embeddings for query/candidates independently, instead of generating recommendations. The idea is to have candidate embeddings indexed on an external vector search engine and use ANN for retrieval later.

rnyak commented 1 year ago

@lecardozo the same notebook shows how to generate candidate and query embeddings.

queries = model.query_embeddings(Dataset(user_features, schema=schema.select_by_tag(Tags.USER)), 
                                 batch_size=1024, index=Tags.USER_ID)
query_embs_df = queries.compute(scheduler="synchronous").reset_index()

item_features = (
    unique_rows_by_features(train, Tags.ITEM, Tags.ITEM_ID).compute().reset_index(drop=True)
)
item_embs = model.candidate_embeddings(Dataset(item_features, schema=schema.select_by_tag(Tags.ITEM)), 
                                       batch_size=1024, index=Tags.ITEM_ID)

hope that helps.

lecardozo commented 1 year ago

That was my first try, as I followed along the whole notebook. As these methods are just thin wrappers around the Encoder.encode(), we end up having the same performance issues that I mentioned befored (which is what made me look at the source code of these methods in the first place).