[🐛BUG] Wrong (maybe) calculation of precision topk metric in sequential models.

Describe the bug I noticed that my model (SASRec) has very good recall@10 metric, but poor precision@10, recall@10 is 70% and precision@10 is 7%. I could not believe that because when evaluating by hand, my model performed very well with precision metric also.

Then I tried to figured out how exactly those metrics are calculated and as I can judge for now, there may be a bug.

I will try to explain. The first thing i noticed here:

# recbole/evaluator/collector.py
if self.register.need("rec.topk"):
      _, topk_idx = torch.topk(
          scores_tensor, max(self.topk), dim=-1
      )  # n_users x k
      pos_matrix = torch.zeros_like(scores_tensor, dtype=torch.int)
      pos_matrix[positive_u, positive_i] = 1
      pos_len_list = pos_matrix.sum(dim=1, keepdim=True)
      pos_idx = torch.gather(pos_matrix, dim=1, index=topk_idx)
      result = torch.cat((pos_idx, pos_len_list), dim=1)
      self.data_struct.update_tensor("rec.topk", result)

here i can see that the shape of result is (batch_size, maxTopK). But in case of sequential dataloader this result will always has at most only one positive item in a row because positive_u is just an torch.arange of batch_size:

# recbole/data/dataloader/general_dataloader.py
interaction = self._dataset[index]
transformed_interaction = self.transform(self._dataset, interaction)
inter_num = len(transformed_interaction)
positive_u = torch.arange(inter_num) #this is how positive_u calculated for sequential dataloader
positive_i = transformed_interaction[self.iid_field]

All in all this leads us to poor precision results because precision then is calculated as sum of true_positives, that will always be at most 1, but in top10 calculation it must be at most 10.

I think for seqeuntial recomendations it will be better to predict topK items by appending K mask tokens to the end of sequence and evaluating it on K last interactions, but now it is actually evaluated on one item at a time (mask one item, evaluate on 1 positive). When we evaluate only on one positive, we can never has precision@>1 be 100% simply because if we predict 2 items, but user has only one positive, precision@2 will be at most 50%.

Thank you for your work and sorry if i missunderstood something in your code and made wrong assumptions.

RUCAIBox / RecBole

[🐛BUG] Wrong (maybe) calculation of precision topk metric in sequential models. #1967