Evaluation metrics don't match with Recall on Predicts.

NVIDIA-Merlin / Transformers4Rec

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation and works with PyTorch.

Apache License 2.0

1.08k stars 142 forks source link

Discussed in https://github.com/NVIDIA-Merlin/Transformers4Rec/discussions/736

^{Originally posted by **korchi** August 3, 2023} Hi. First of all, thank you for the great tool you are developing. However, I am puzzled already for a week about how can I replicate `trainer.evaluation()` results with inferencing the model. My initial idea was to truncate every sessions by one (removing last item_id), call `trainer.predict(truncated_sessions)`, and then compute `recall(last_item_ids, predictions[:20])`. However, I am getting different recall metric. The only way I managed to "replicate" evaluate() results is by: (1) providing not-truncated inputs to the `trainer.predict()` and (2) changing `-1` into `-2` in https://github.com/NVIDIA-Merlin/Transformers4Rec/blob/348c9636399535c566d20e8ebff2b7aa0775f136/transformers4rec/torch/model/prediction_task.py#L460 . I am puzzled why, but this was the only way I could ensure that the `x` in https://github.com/NVIDIA-Merlin/Transformers4Rec/blob/348c9636399535c566d20e8ebff2b7aa0775f136/transformers4rec/torch/model/prediction_task.py#L464 (during inference) is the same as `x` in https://github.com/NVIDIA-Merlin/Transformers4Rec/blob/348c9636399535c566d20e8ebff2b7aa0775f136/transformers4rec/torch/model/prediction_task.py#L444 (during evaluation). Is it because `trainer.evaluate()` shifts the inputs to the left by one position? Or what am I doing incorrectly? Could any provide me insights how to do it "correctly", please? Thanks a lot.

@korchi In the predict, say this is your original input sequence list: [1, 2, 3, 4, 5] and these are the item ids. If you want to compare trainer.evaluate() and trainer.predict() your input sequences that you feed to the model should be different. For predict, say your target item is the last item which is 5, then you should feed this as an input: [1, 2, 3, 4]. what's gonna happen the model is gonna use first four entires and predict the fifth one, and then you can compare with your ground truth item id (it is 5) in this example:

[1, 2, 3, 4] --> [1, 2, 3, 4, predicted_item_id]

compare predicted_item_id with the ground truth.

Whereas, for trainer.evaluate() you feed entire sequence [1,2,3,4,5] and the evaluate func will do its job and generate a predicted item id for the last item in the sequence by using the items before that.

please also see this example here.

NVIDIA-Merlin / Transformers4Rec

Evaluation metrics don't match with Recall on Predicts. #786

Discussed in https://github.com/NVIDIA-Merlin/Transformers4Rec/discussions/736