Open rnyak opened 1 year ago
If I use dev branch, I am getting much higher CLM accuracy metrics (~2.5x higher) compared to MLM from end-to-end
example with yoochoose dataset. I think this is not expected.
Is this bug already fixed in some T4R version? I am currently experiencing similar discrepancies when it comes to evaluating NDCG and MRR metrics on my dataset. My question is: is it worth creating a reproducible example, or are you already working on it?"
@SPP3000 can you please provide more details about I am currently experiencing similar discrepancies
?
what model you are using? and how do you evaluate? are you using our evaluation method fit_and_evaluate
function?
@SPP3000 are you seeing same issue with XLNet MLM? did you test MLM?
Hello, are there any updates regarding this issue? @rnyak @SPP3000
Bug description
When we train an XLNet model with
CLM
masking, the model prints out its own evaluation metrics (ndcg@k, recall@k, etc.) fromtrainer.evaluate()
step. If we want to apply our own custom metric func using numpy something like below, the metric values do not match, but they match if we useMLM
masking instead.Steps/Code to reproduce bug
coming soon.
Expected behavior
Environment details
Additional context