IINemo / lm-polygraph

MIT License
114 stars 21 forks source link

Get the uncertainty scores without rerun the models (for NumSets, Deg, Ecc) #155

Closed caiqizh closed 6 months ago

caiqizh commented 9 months ago

Thank you for providing the codes for the previously generated text! They have been very helpful, and I've successfully used them for Lexical Similarity analysis. I'm planning to test them for other measurements, including NumSets, Degree matrix (Deg), and Eccentricity.

I noticed that these measurements require two additional statistics: semantic_matrix_entail and semantic_matrix_contra. According to the original paper, I know that these are calculated using DeBERTa over generated samples. I'm wondering if there are any short code snippets available to compute these matrices and feed them into the estimator function.

Thanks!

rvashurin commented 9 months ago

Hey @caiqizh! We're happy to hear that you've been able to make use of our library!

Currently you can change the way these matrix are calculated only by directly changing the code in this module. We will work on improving customizability of implemented methods in the near future, so this will probably become much simpler soon.

cant-access-rediska0123 commented 9 months ago

As a temporary solution, you can use the following snippet, which uses SemanticMatrixCalculator to calculate these additional statistics. But we hope to improve the code for your usecase in the nearest future.

from lm_polygraph.estimators import Eccentricity
from lm_polygraph.stat_calculators import SemanticMatrixCalculator

# Put your 5-10 text samples from ChatGPT generation
samples = [
    'The capital of France is Paris.',
    'Paris is the capital city of France.',
    'The capital of France is Paris.',
    'In France, the capital is Paris.',
    'The capital city of France is Paris.',
]

stats = {
    'blackbox_sample_texts': [samples],
    'deberta_batch_size': 10,
}

nli_calculator = SemanticMatrixCalculator()
stats.update(nli_calculator(stats, None, None, None))
# Now stats should contain 'semantic_matrix_entail' and 'semantic_matrix_contra'

estimator = Eccentricity()
uncertainty = estimator(stats)[0]
print(uncertainty)  # 7.886122580038351e-05
caiqizh commented 12 hours ago

Thank you for the previous answer! It looks like in the latest version this does not work anymore. Could you please reopen this issue? Thanks!