Add support for ablation evaluation

rbiswasfc commented 1 week ago

This PR adds script and configs to run ablation evaluations with 4 datasets (MNLI, WiC, BoolQ, EurLEX). BoolQ and WiC fine-tuning uses checkpoint obtained from the MNLI training run.

Eval for flex bert can be run using python ablation_eval.py yamls/ablations/flex-bert-ablation-eval.yaml, which gives following scores:

-----------------------------------------------------------------------------------------------------------------
Job                                               | Dataset                  | Metric                     |
-----------------------------------------------------------------------------------------------------------------
MNLI(seed=23)                                     | glue_mnli_mismatched     | MulticlassAccuracy         | 84.53
EURLEX(seed=23)                                   | long_context_eurlex      | EurlexMultilabelF1Score    | 71.85
BOOLQ(seed=23)                                    | superglue_boolq          | MulticlassAccuracy         | 78.35
WIC(seed=23)                                      | superglue_wic            | MulticlassAccuracy         | 65.99
-----------------------------------------------------------------------------------------------------------------

rbiswasfc commented 4 days ago

This looks good. Can you add a smoketest for ablation_eval.py like we have for glue.py? Then it will be ready to merge.

Thanks for the suggestion, just added the smoketest!

NohTow commented 4 days ago

LGTM, as raised, the results are still a bit low but this seems ok and this does not come from the PR as it was already the case with my prior tests.

AnswerDotAI / bert24

Add support for ablation evaluation #75