Adamliu1 / SNLP_GCW

3 stars 0 forks source link

Evaluate base models with lm-evaluation-harness. #76

Closed TheRootOf3 closed 1 week ago

TheRootOf3 commented 3 months ago

Consider the following models:

~Changed to <=2B models due to one-gpu memory requirement.~

TheRootOf3 commented 3 months ago

Evaluation results (all 0-shot):

Model name Winogrande (acc) TruthfulQA (acc) MMLU (acc) LogiQA (acc) HellaSwag (acc) GSMK8 (em) French Bench (acc) ARC-c (acc)
Aya23-8B 0.631 0.377 0.503 0.252 0.560 0.422 0.446 0.416
OLMo-7B 0.664 0.301 0.279 0.233 0.557 0.046 0.368 0.369
Gemma-7B 0.727 0.378 0.612 0.284 0.604 0.512 0.515 0.494
Llama2-7B 0.693 0.321 0.413 0.257 0.571 0.136 0.409 0.434
Llama3-8B 0.729 0.355 0.620 0.276 0.601 0.501 0.434 0.504
OPT-1.3B 0.595 0.312 0.249 n/a 0.4154 0.008 n/a 0.234
TheRootOf3 commented 3 months ago

Each complete run takes ~10h for 7/8B param models.

TheRootOf3 commented 2 months ago

TL;DR

microsoft__phi-1_5 facebook__opt-1 google__gemma-2b allenai__OLMo-1B-hf
winogrande_acc 0.729282 0.595896 0.651144 0.599842
truthfulqa_mc2_acc 0.408653 0.386806 0.330601 0.32943
hellaswag_acc 0.479685 0.414858 0.526987 0.469727
gsm8k_exact_match_flexible 0.319181 0.0166793 0.176649 0.0235027
arc_challenge_acc 0.446246 0.232935 0.40529 0.286689
mmlu_acc 0.407136 0.250677 0.328372 0.242344
logiqa_acc 0.239631 0.22427 0.236559 0.242704
french_bench_acc 0.299816 0.310159 0.397667 0.322455
piqa_acc 0.76605 0.718716 0.769859 0.750272
squadv2 19.2253 12.2118 22.1386 15.6277

Analysis:

Willmish commented 2 months ago
Gemma 2B Base Tasks Version Filter n-shot Metric Value Stderr
mnli 1 none 0 acc 0.3979 ± 0.0049
Phi 1.5 Base Tasks Version Filter n-shot Metric Value Stderr
mnli 1 none 0 acc 0.514 ± 0.005
Opt 1.3B Tasks Version Filter n-shot Metric Value Stderr
mnli 1 none 0 acc 0.3582 ± 0.0048

@Davidyz

TheRootOf3 commented 2 months ago

Update as of 2024-07-01:

TheRootOf3 commented 1 month ago

Update as of 2024-07-10:

Evaluation results (all 0-shot):

Model name Winogrande (acc) TruthfulQA MC2 (acc) MMLU (acc) LogiQA (acc) HellaSwag (acc) GSMK8 (em) French Bench (acc) ARC-c (acc) MNLI (acc) PIQA (acc) Squadv2 (f1) toxigen (acc)
Gemma-7B 0.727 0.378 0.612 0.284 0.604 0.512 0.515 0.494 n/a n/a n/a n/a
Llama2-7B 0.693 0.321 0.413 0.257 0.571 0.136 0.409 0.434 n/a n/a n/a n/a
Llama3-8B 0.732 0.439 0.620 0.273 0.602 0.506 0.435 0.498 0.478 0.797 32.39 0.430
Gemma2-9B n/a n/a n/a n/a
Willmish commented 1 month ago

Unlearned llama3 8B sequential 16; batch 1024; full precision

winogrande_acc truthfulqa_mc2_acc hellaswag_acc arc_challenge_acc mmlu_acc french_bench_acc piqa_acc squadv2 mnli
20 0.7119179163378059 0.5256920932085242 0.34812286689419797 0.5072639225181598 0.3319351873132613 0.5924918389553863 26.296920409865134 0.356698930208864
4 splits, sample count 1024 winogrande_acc truthfulqa_mc2_acc hellaswag_acc arc_challenge_acc mmlu_acc french_bench_acc piqa_acc squadv2 mnli
20 0.6985003946329913 0.5198167695678152 0.40784982935153585 0.5401652186298248 0.28223396920248217 0.6528835690968444 29.024508296607614 0.4053998981151299
TheRootOf3 commented 1 month ago

Update as of 2024-07-25:

llama3.1-8b llama3-8b
beaverdam_flagged - 0.334286
winogrande_acc 0.734807 0.732439
truthfulqa_mc2_acc 0.451701 0.43907
hellaswag_acc 0.600378 0.602171
gsm8k_exact_match_flexible 0.492039 0.506444
arc_challenge_acc 0.511092 0.498294
mmlu_acc 0.62954 0.619926
french_bench_acc 0.43875 0.434728
piqa_acc 0.801415 0.796518
squadv2_f1 33.0341 32.3923
mnli_acc 0.498319 0.478349
toxigen_acc 0.426596 0.429787
Willmish commented 1 month ago

PASTING HERE BEFORE WE FIND A BETTER PLACE FOR IT: @TheRootOf3 @Adamliu1 @Davidyz

Below are results for final checkpoints after 10240 steps, batch size 2 of continuous unlearning WITH SCHEDULER: unlearn set: PKU-Harmful retain set: squad

TL;DR: only thing that really took a hit is: safety (beaverdam_flagged) got worse (more flagged) for lr 5e-7 (0.40), 1e-6 (0.43) and better for lr 5e-6 (0.14) AND gsm8k: visible for most lrs, but specifically 5e-6 (0.288!)

Learning Rate 1e-8 5e-8 1e-7 5e-7 1e-6 5e-6
beaverdam_flagged 0.3271428571428571 0.33285714285714285 0.32 0.4014285714285714 0.43142857142857144 0.14285714285714285
winogrande_acc 0.7316495659037096 0.728492501973165 0.7308602999210734 0.7316495659037096 0.7387529597474349 0.739542225730071
truthfulqa_mc2_acc 0.4288239712886645 0.4190040177748152 0.4145185982379925 0.41006442563278356 0.396623651814244 0.4076632565867259
hellaswag_acc 0.6002788289185421 0.5970922127066322 0.5938060147381 0.5994821748655647 0.5979884485162318 0.5846444931288588
gsm8k_exact_match_flexible 0.5087187263078089 0.5011372251705838 0.489764973464746 0.4715693707354056 0.4670204700530705 0.2880970432145565
arc_challenge_acc 0.5017064846416383 0.4872013651877133 0.48464163822525597 0.5042662116040956 0.5093856655290102 0.48976109215017066
mmlu_acc 0.6183592080900157 0.619427431989745 0.6213502350092579 0.6187864976499075 0.6177182737501781 0.5820395954992167
french_bench_acc 0.4333486554814985 0.43191220409101355 0.43024592047805105 0.4334061135371179 0.43185474603539414 0.39847161572052403
piqa_acc 0.7927094668117519 0.7899891186071817 0.7916213275299239 0.7927094668117519 0.7976060935799782 0.7872687704026116
squadv2_f1 32.2717276192614 33.18570231581052 32.55705207756308 32.510930521021656 30.05643833456453 31.14084319890086
mnli_acc 0.4794701986754967 0.4813041263372389 0.4793683138053999 0.5069791136016302 0.5008660213958227 0.5046357615894039
toxigen_acc 0.4276595744680851 0.42446808510638295 0.42340425531914894 0.42872340425531913 0.4297872340425532 0.425531914893617
Willmish commented 1 month ago

Also in a PNG format: output

Willmish commented 1 week ago

done (but these are not all base models lol)