NVIDIA-Merlin / Transformers4Rec

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation and works with PyTorch.
https://nvidia-merlin.github.io/Transformers4Rec/main
Apache License 2.0
1.08k stars 142 forks source link

Evaluation metrics don't match with Recall on Predicts. #786

Open korchi opened 1 month ago

korchi commented 1 month ago

Discussed in https://github.com/NVIDIA-Merlin/Transformers4Rec/discussions/736

Originally posted by **korchi** August 3, 2023 Hi. First of all, thank you for the great tool you are developing. However, I am puzzled already for a week about how can I replicate `trainer.evaluation()` results with inferencing the model. My initial idea was to truncate every sessions by one (removing last item_id), call `trainer.predict(truncated_sessions)`, and then compute `recall(last_item_ids, predictions[:20])`. However, I am getting different recall metric. The only way I managed to "replicate" evaluate() results is by: (1) providing not-truncated inputs to the `trainer.predict()` and (2) changing `-1` into `-2` in https://github.com/NVIDIA-Merlin/Transformers4Rec/blob/348c9636399535c566d20e8ebff2b7aa0775f136/transformers4rec/torch/model/prediction_task.py#L460 . I am puzzled why, but this was the only way I could ensure that the `x` in https://github.com/NVIDIA-Merlin/Transformers4Rec/blob/348c9636399535c566d20e8ebff2b7aa0775f136/transformers4rec/torch/model/prediction_task.py#L464 (during inference) is the same as `x` in https://github.com/NVIDIA-Merlin/Transformers4Rec/blob/348c9636399535c566d20e8ebff2b7aa0775f136/transformers4rec/torch/model/prediction_task.py#L444 (during evaluation). Is it because `trainer.evaluate()` shifts the inputs to the left by one position? Or what am I doing incorrectly? Could any provide me insights how to do it "correctly", please? Thanks a lot.
rnyak commented 1 month ago

@korchi In the predict, say this is your original input sequence list: [1, 2, 3, 4, 5] and these are the item ids. If you want to compare trainer.evaluate() and trainer.predict() your input sequences that you feed to the model should be different. For predict, say your target item is the last item which is 5, then you should feed this as an input: [1, 2, 3, 4]. what's gonna happen the model is gonna use first four entires and predict the fifth one, and then you can compare with your ground truth item id (it is 5) in this example:

[1, 2, 3, 4] --> [1, 2, 3, 4, predicted_item_id]

compare predicted_item_id with the ground truth.

Whereas, for trainer.evaluate() you feed entire sequence [1,2,3,4,5] and the evaluate func will do its job and generate a predicted item id for the last item in the sequence by using the items before that.

please also see this example here.