Closed feralvam closed 4 years ago
hi @feralvam, Thank you for your interest in our repo.
The implementation in the score
function is our standard approach of rescaling, and it is explained in this blogpost.
The rescaling in plot_example
is different. We seek to visualize the similarities to give an intuitive picture that will be close to the final output (P, R, F1). However, this is difficult because 1. the baselines are not computed over the raw similarites 2. P, R, F1 are computed over the maximum over raw/column in the similarity. Eventually, we decided to use the F1 baseline score to rescale it. This choice is a less principled compromise.
Hope my reply help.
Hi, Sorry to bring back this topic. I was curious about something: would it make sense to rescale the similarities instead of the (P, R, F1) scores? In the blogpost, you state that Let BASE be a lower bound for BERTScores that we typically observe in practice. It should be possible to compute a lower bound for the similarities, no? I was wondering if you had already explored this idea, and decided that it was too problematic (for some reason) and that it was better to rescale the actual (P, R, F1) scores instead. Thanks for any insight!
hi @feralvam no worries, I am happy to discuss more.
Yea one of the concern that I had is if we rescale the similarities using a this approach, then the color (and the rescaled similarities) in the visualization wouldn't correlate well with the evaluation output (P, R, F).
I am not sure if I made myself clear. After applying the current rescaling (i.e. using F1), the average of highest similarities in each row and column would roughly have the same magnitude with the output F1.
Again, all of this is for visualization purpose. If in your application the similarities themselves are actually important, feel free to fork and implement your own changes.
Thanks for the quick reply. Yes, for my application the actual similarity scores are important, and I was thinking of ways to rescale them. That's why I asked before trying something, in case you had already explored this idea. Thanks!
One final question: which monolingual corpora did you use to compute the baseline scores for each language? I'm currently interested in the one for English. Thanks.
hi @feralvam I just randomly sampled 1 million random pairs of sentence from the common crawl corpus. The average score stabilizes pretty fast so I think you can just create your own random corpus.
Hello, I hope you can help me understand some part of the rescaling logic.
In the
score
method, rescaling is done using: https://github.com/Tiiiger/bert_score/blob/3a974d46e484f892fd88ad93eff85f8a6b731ac8/bert_score/scorer.py#L215-L216This makes sense since
all_preds
contains P, R, F scores per row, and this is the information inself.baseline_vals
However, in the
plot_example
method, the rescaling is done using: https://github.com/Tiiiger/bert_score/blob/3a974d46e484f892fd88ad93eff85f8a6b731ac8/bert_score/scorer.py#L259-L260In this case the rescaling is done over the similarities and using the F values in
self.baseline_vals[2]
(if I'm understanding correctly). Why is this done this way here? Why are the F scores good rescaling values for the "raw" similarity scores?I understand that rescaling is merely performed to make the scores more interpretable, since P, R, F are calculated before rescaling. However, I was curious about this difference in the implementations. Thank you in advance for your help.