Closed shahjaidev closed 2 years ago
Hi @shahjaidev, thank you for your question!
The proposed architecture in our Transformers4rec paper used the weight-tying technique to link the input item embeddings and the outputs returned by the NextItemPredictionTask. By training with this technique, the model reuses the item embeddings table as the weights of the output layer, so that both the user sequence representation and item embeddings are in a compatible vector space. So during inference, you can export the item embeddings table for the ANN index but you also need to export the Transformer block for generating the user's representation.
A typical pipeline would be:
During inference:
For more details about how to deploy a recommender system, you can check our example notebooks here: we particularly showcase how to setup Faiss for ANN.
Please let us know if that answers your question?
@shahjaidev closing this issue for now. you can reopen if you have further questions.
❓ Questions & Help
Details
If Transformers4Rec is used at training time (offline), does this necessitate that the model must do real time inference when deployed?
To put the question a bit more concretely: Say I train a transformer on a dataset that consists of sequences of user item clicks(same session)
To do next item prediction at deployment, is it necessary to run inference on the trained transformer ?
The alternative (computationally cheaper) would be to utilize the trained transformer to pre-compute item embeddings (specifically, context vectors ) for every item id. At inference time, simply run ANN given the current item’s embedding. The way these item embeddings would be generated would be passing in 1-id sequences to the transformer and getting the context vector. Is this a reasonable idea?
I’m a bit hesitant to convince myself of the second approach because embeddings that come from a transformer are by nature context dependent and the whole premise of using the transformer was so that the current items context vector could attend to the previously observed items.
Would the context vectors resulting from. 1-element sequences be any more powerful than if one simply ran CBOW Word2Vec on clicked item sequences ?