GEM-benchmark / GEM-metrics

Automatic metrics for GEM tasks
https://gem-benchmark.com
MIT License
61 stars 20 forks source link

Heavy metrics are not calculated in single-file mode despite using the heavy-metrics option #76

Closed madaan closed 2 years ago

madaan commented 2 years ago

Issue

The --heavy-metrics option is not working in single file mode.

Steps to reproduce:

The bug can be reproduced using test data with the following command:

python run_metrics.py -r test_data/single_dataset.refs.json  test_data/single_dataset.outs.json --heavy_metrics

Current behavior

The heavy metrics are not computed. The output is:

[I 220112 23:12:18 texts:32] Loading predictions for None
[I 220112 23:12:18 texts:32] Loading references for test_data/single_dataset.refs.json
[I 220112 23:12:18 __init__:124] Computing MSTTR for None...
[I 220112 23:12:18 __init__:124] Computing NGramStats for None...
[I 220112 23:12:18 __init__:150] Computing CHRF for None...
[I 220112 23:12:18 __init__:150] Computing NIST for None...
[I 220112 23:12:18 __init__:150] Computing BLEU for None...
[I 220112 23:12:18 __init__:150] Computing LocalRecall for None...
[I 220112 23:12:18 __init__:150] Computing ROUGE for None...
{
    "predictions_file": null,
    "N": 2,
    "msttr-100": NaN,
    "msttr-100_nopunct": NaN,
    "total_length": 27,
    "mean_pred_length": 13.5,
    "std_pred_length": 0.5,
    "median_pred_length": 13.5,
    "min_pred_length": 13,
    "max_pred_length": 14,
    "distinct-1": 0.6296296296296297,
    "vocab_size-1": 17,
    "unique-1": 9,
    "entropy-1": 3.9582291686698787,
    "distinct-2": 0.8,
    "vocab_size-2": 20,
    "unique-2": 15,
    "entropy-2": 4.243856189774724,
    "cond_entropy-2": 0.22256268772664103,
    "distinct-3": 0.8695652173913043,
    "vocab_size-3": 20,
    "unique-3": 17,
    "entropy-3": 4.262692390839622,
    "cond_entropy-3": 0.010140548890983904,
    "total_length-nopunct": 24,
    "mean_pred_length-nopunct": 12.0,
    "std_pred_length-nopunct": 1.0,
    "median_pred_length-nopunct": 12.0,
    "min_pred_length-nopunct": 11,
    "max_pred_length-nopunct": 13,
    "distinct-1-nopunct": 0.6666666666666666,
    "vocab_size-1-nopunct": 16,
    "unique-1-nopunct": 9,
    "entropy-1-nopunct": 3.886842188131012,
    "distinct-2-nopunct": 0.8181818181818182,
    "vocab_size-2-nopunct": 18,
    "unique-2-nopunct": 14,
    "entropy-2-nopunct": 4.095795255000933,
    "cond_entropy-2-nopunct": 0.18150945892357132,
    "distinct-3-nopunct": 0.9,
    "vocab_size-3-nopunct": 18,
    "unique-3-nopunct": 16,
    "entropy-3-nopunct": 4.1219280948873624,
    "cond_entropy-3-nopunct": 0.012496476250064989,
    "references_file": "test_data/single_dataset.refs.json",
    "chrf": 69.41824174364662,
    "chrf+": 69.96223388406395,
    "chrf++": 66.1348715190809,
    "nist": 3.9977776020129183,
    "bleu": 50.12433,
    "local_recall": {
        "1": 0.0,
        "2": 0.2857142857142857,
        "3": 0.8888888888888888,
        "4": 1.0
    },
    "rouge1": {
        "precision": 0.76389,
        "recall": 0.7487,
        "fmeasure": 0.75211
    },
    "rouge2": {
        "precision": 0.51719,
        "recall": 0.45112,
        "fmeasure": 0.48053
    },
    "rougeL": {
        "precision": 0.69544,
        "recall": 0.71699,
        "fmeasure": 0.70229
    },
    "rougeLsum": {
        "precision": 0.69544,
        "recall": 0.71699,
        "fmeasure": 0.70229
    }
}

Expected behavior

All the heavy metrics should be computed in addition to the lightweight metrics.