Rescale in scorer.score vs scorer.plot_example

feralvam commented 4 years ago

Hello, I hope you can help me understand some part of the rescaling logic.

In the score method, rescaling is done using: https://github.com/Tiiiger/bert_score/blob/3a974d46e484f892fd88ad93eff85f8a6b731ac8/bert_score/scorer.py#L215-L216

This makes sense since all_preds contains P, R, F scores per row, and this is the information in self.baseline_vals

However, in the plot_example method, the rescaling is done using: https://github.com/Tiiiger/bert_score/blob/3a974d46e484f892fd88ad93eff85f8a6b731ac8/bert_score/scorer.py#L259-L260

In this case the rescaling is done over the similarities and using the F values in self.baseline_vals[2] (if I'm understanding correctly). Why is this done this way here? Why are the F scores good rescaling values for the "raw" similarity scores?

I understand that rescaling is merely performed to make the scores more interpretable, since P, R, F are calculated before rescaling. However, I was curious about this difference in the implementations. Thank you in advance for your help.

Tiiiger commented 4 years ago

hi @feralvam, Thank you for your interest in our repo.

The implementation in the score function is our standard approach of rescaling, and it is explained in this blogpost.

The rescaling in plot_example is different. We seek to visualize the similarities to give an intuitive picture that will be close to the final output (P, R, F1). However, this is difficult because 1. the baselines are not computed over the raw similarites 2. P, R, F1 are computed over the maximum over raw/column in the similarity. Eventually, we decided to use the F1 baseline score to rescale it. This choice is a less principled compromise.

Hope my reply help.

feralvam commented 4 years ago

Hi, Sorry to bring back this topic. I was curious about something: would it make sense to rescale the similarities instead of the (P, R, F1) scores? In the blogpost, you state that Let BASE be a lower bound for BERTScores that we typically observe in practice. It should be possible to compute a lower bound for the similarities, no? I was wondering if you had already explored this idea, and decided that it was too problematic (for some reason) and that it was better to rescale the actual (P, R, F1) scores instead. Thanks for any insight!

Tiiiger commented 4 years ago

hi @feralvam no worries, I am happy to discuss more.

Yea one of the concern that I had is if we rescale the similarities using a this approach, then the color (and the rescaled similarities) in the visualization wouldn't correlate well with the evaluation output (P, R, F).

I am not sure if I made myself clear. After applying the current rescaling (i.e. using F1), the average of highest similarities in each row and column would roughly have the same magnitude with the output F1.

Again, all of this is for visualization purpose. If in your application the similarities themselves are actually important, feel free to fork and implement your own changes.

feralvam commented 4 years ago

Thanks for the quick reply. Yes, for my application the actual similarity scores are important, and I was thinking of ways to rescale them. That's why I asked before trying something, in case you had already explored this idea. Thanks!

feralvam commented 4 years ago

One final question: which monolingual corpora did you use to compute the baseline scores for each language? I'm currently interested in the one for English. Thanks.

Tiiiger commented 4 years ago

hi @feralvam I just randomly sampled 1 million random pairs of sentence from the common crawl corpus. The average score stabilizes pretty fast so I think you can just create your own random corpus.

Tiiiger / bert_score

Rescale in scorer.score vs scorer.plot_example #55