Factual consistency of a simple sentence pair is not giving expected result.

Hi , I try to test this on a simple sentence and it seems the score doesn't confirm the factual consistency of the premise and hypothesis. here is the code and output

` from scale_score.scorer import SCALEScorer scorer = SCALEScorer(size='small', device='mps') premise=["My name is Ibrahim Lincoln"] hypothesis=["I am Ibrahim Lincoln"] scorer.score(premise,[hypothesis])

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2.03it/s] [0.053020697087049484] `

Is there something wrong with the code ? Any possible explanation of the result above ?

Hi! Different model sizes change the effectiveness of the technique, it's recommended to get the best results with a zero shot model you should use either the large or xl size. If speed is a large priority, finetuning a Flan-T5-Base model on your dataset can also lead to good results. I will update the README to add this as a note :)

For example, with the sample you've provided here:

from scale_score.scorer import SCALEScorer
scorer = SCALEScorer(size='small', device='cuda')
premise=["My name is Ibrahim Lincoln"]
hypothesis=["I am Ibrahim Lincoln"]
scorer.score(premise,[hypothesis])

[0.12147488445043564]

from scale_score.scorer import SCALEScorer
scorer = SCALEScorer(size='base', device='cuda')
premise=["My name is Ibrahim Lincoln"]
hypothesis=["I am Ibrahim Lincoln"]
scorer.score(premise,[hypothesis])

[0.6126947402954102]

from scale_score.scorer import SCALEScorer
scorer = SCALEScorer(size='large', device='cuda')
premise=["My name is Ibrahim Lincoln"]
hypothesis=["I am Ibrahim Lincoln"]
scorer.score(premise,[hypothesis])

[0.8836435675621033]

from scale_score.scorer import SCALEScorer
scorer = SCALEScorer(size='large', device='cuda')
premise=["My name is Ibrahim Lincoln"]
hypothesis=["I am Ibrahim Lincoln"]
scorer.score(premise,[hypothesis])

[0.8012929558753967]

asappresearch / scale-score

Factual consistency of a simple sentence pair is not giving expected result. #3