Estime ZeroDivisionError

UntotaufUrlaub commented 1 year ago

from blanc import Estime

Estime(output=['alarms', 'soft', 'coherence'], device="cuda").evaluate_claims(
    """23 March 2017 Last updated at 14:55 GMT""",
    ["""northern ireland secretary martin mcguinness says he\'s " shocked " by the outcome of the assembly election."""])

leads to

File ".../python3.11/site-packages/blanc/estime.py", line 312, in _evaluate
    cos_raw_avg /= len(embs_summ)
ZeroDivisionError: division by zero

(The doc and summary pair are taken from the aggrefact benchmark as an unfaithful pair.)

OlegVasilyev4096 commented 1 year ago

Thanks, will try to fix this today (indeed, this bug would affect extremely unfaithful pairs).

UntotaufUrlaub commented 1 year ago

Thank you very much! :)

OlegVasilyev4096 commented 1 year ago

Update to blanc 0.3.2 should fix this issue. Now at initialization of Estime the default include_all_tokens=False is changed to True if there is 'soft' or 'coherence' in the 'output' list. The original 'hard' ESTIME ('alarms') should be obtained with include_all_tokens=False (default). Otherwise it would count all the summary tokens nonexistent in the text as 'alarms'. (This would not make essential difference for usual summaries that rarely use such tokens, but this would make a big difference for summaries using many different words). BTW the original 'hard' ESTIME is useless for such examples as yours: not a single common token between summary and text - hence the resulting score is zero.)

ESTIME-soft (and coherence) should be obtained with include_all_tokens=True - see eq.(2) in https://arxiv.org/pdf/2112.11638.pdf (while hard ESTIME is in eq.1). So, now the self.include_all_tokens is changed to True if 'soft' or 'coherence' is being calculated. To calculate properly the original 'alarms' it is necessary to run 'alarms' separately.

UntotaufUrlaub commented 1 year ago

Thank you very much for the fix and explanation! Maybe this should be added to the Readme. So does the param include_all_tokens even make sense as param? Your explanation sounds like each type has a fixed value for it anyway. (or is the param only intended to be used to switch the calc in the alarm case?) If soft and coherence should be calculated separately from alarms at all times, maybe the metric should raise a Warning or an Error if the user tries to calculate them together. What are your thoughts on this?

OlegVasilyev4096 commented 1 year ago

Thanks, all good questions. True, both versions of include_all_tokens could be of interest for 'alarms' (though normal use is include_all_tokens=False). It maybe nice to still allow calculating all the options at once, because the calculation of embeddings is the heavier part. But I'd have to adjust the logic in evaluation, to accounting (if necessary) both through all tokens and through overlap tokens. Let me first consider issue 52 (coherence) and then get to such rearrangement (and change readme).

UntotaufUrlaub commented 1 year ago

I agree that

both versions of include_all_tokens could be of interest for 'alarms'

and

It maybe nice to still allow calculating all the options at once

I am not sure whether I understand what you propose to solve the issue. Do you want to change the method in such a way, that include_all_tokens only has an effect for ESTIME-alarms, but everything can be calculated at once, meaning the internal calculations diverge in the case of include_all_tokens=False?

If the calculation of the embeddings is the heavier part, it might be useful to be also able to calculate both alarm cases at once. I guess that is only useful in same edge cases, but as I want to evaluate all versions and compare them on a benchmark it would come in pretty handy (For example type could be changed to alarms-all, alarms-notall, soft, consistency, or setting include_all_tokens=None could trigger both options). I totally understand if that suggestion is to special.

Thank you very much for all your effort!

OlegVasilyev4096 commented 1 year ago

Thanks for clarifying - agree, that is what I intend, to diverge all the accounting for all specified cases - after the embeddings for all tokens are calculated. Will try to add this sometime soon or latest at this weekend.

OlegVasilyev4096 commented 1 year ago

Updated in version 0.3.3 See description https://github.com/PrimerAI/blanc/tree/master/estime and comments in Estime() The measure 'alarms_adjusted' may be a little better than the original 'alarms', though the difference for usual machine-generated summaries should be small. Of course the 'soft' should work more evenly across much wider range of summary quality. If no more problems directly related to the original issue, I will close this issue after a while. Thanks for using and testing, please open new issues whenever needed.

UntotaufUrlaub commented 1 year ago

Thanks for the update! The new method interface looks pretty nice. The 'alarms_adjusted' option is an addition, which was not provided by the old functionality, isn't it?

OlegVasilyev4096 commented 1 year ago

Yes, it is new (and thanks to your issue), it has very simple sense: simple assumption that the non-overlap tokens (hence not tested by 'alarms') has the same probability of error per token as the overlap tokens. Curious how ESTIME ('alarms'), ESTIME-adjusted ('alarms-adjusted') and ESTIME-soft ('soft') compare.

UntotaufUrlaub commented 1 year ago

Thanks for the clarification.

You mean performance compared to human ratings, e.g accuracy or correlation?

OlegVasilyev4096 commented 1 year ago

Yes, my guess the performance of 'adjusted' would be slightly better than 'alarms' for summaries from typical generation models that mostly use the same words that are in the texts. In a wider range of quality or style of generation - not sure. But 'soft' would be better than both in a wider range, again just my guess.

UntotaufUrlaub commented 1 year ago

I am currently working on some evaluations. If I have time to include the new version, I post it here.

PrimerAI / blanc

Estime ZeroDivisionError #50