NVIDIA-Merlin / Transformers4Rec

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation and works with PyTorch.
https://nvidia-merlin.github.io/Transformers4Rec/main
Apache License 2.0
1.08k stars 142 forks source link

[BUG] XLNET-CLM eval recall metric value does not match with custom np based recall metric value #719

Open rnyak opened 1 year ago

rnyak commented 1 year ago

Bug description

When we train an XLNet model with CLM masking, the model prints out its own evaluation metrics (ndcg@k, recall@k, etc.) from trainer.evaluate() step. If we want to apply our own custom metric func using numpy something like below, the metric values do not match, but they match if we use MLM masking instead.

def recall(predicted_items: np.ndarray, real_items: np.ndarray) -> float:
    bs, top_k = predicted_items.shape
    valid_rows = real_items != 0

    # reshape predictions and labels to compare
    # the top-10 predicted item-ids with the label id.
    real_items = real_items.reshape(bs, 1, -1)
    predicted_items = predicted_items.reshape(bs, 1, top_k)

    num_relevant = real_items.shape[-1]
    predicted_correct_sum = (predicted_items == real_items).sum(-1)
    predicted_correct_sum = predicted_correct_sum[valid_rows]
    recall_per_row = predicted_correct_sum / num_relevant
    return np.mean(recall_per_row)

Steps/Code to reproduce bug

coming soon.

Expected behavior

Environment details

Additional context

rnyak commented 1 year ago

If I use dev branch, I am getting much higher CLM accuracy metrics (~2.5x higher) compared to MLM from end-to-end example with yoochoose dataset. I think this is not expected.

SPP3000 commented 11 months ago

Is this bug already fixed in some T4R version? I am currently experiencing similar discrepancies when it comes to evaluating NDCG and MRR metrics on my dataset. My question is: is it worth creating a reproducible example, or are you already working on it?"

rnyak commented 11 months ago

@SPP3000 can you please provide more details about I am currently experiencing similar discrepancies ?

what model you are using? and how do you evaluate? are you using our evaluation method fit_and_evaluate function?

SPP3000 commented 11 months ago

@rnyak I just opened a new bug report with all details here.

rnyak commented 11 months ago

@SPP3000 are you seeing same issue with XLNet MLM? did you test MLM?

dcy0577 commented 9 months ago

Hello, are there any updates regarding this issue? @rnyak @SPP3000