Evaluate base models with lm-evaluation-harness.

TheRootOf3 commented 3 months ago

Consider the following models:

LLaMa 3 8B
LLaMa 3.1 8B
~LLaMa 2 7B~
~OLMo 7B~
~Gemma 7B~
~Aya 23 8B~
Gemma2 9B

~Changed to <=2B models due to one-gpu memory requirement.~

~OLMo 1B~
~Gemma 2B~
~Phi 1.5~
~OPT 1.3B~

TheRootOf3 commented 3 months ago

Evaluation results (all 0-shot):

Model name	Winogrande (acc)	TruthfulQA (acc)	MMLU (acc)	LogiQA (acc)	HellaSwag (acc)	GSMK8 (em)	French Bench (acc)	ARC-c (acc)
Aya23-8B	0.631	0.377	0.503	0.252	0.560	0.422	0.446	0.416
OLMo-7B	0.664	0.301	0.279	0.233	0.557	0.046	0.368	0.369
Gemma-7B	0.727	0.378	0.612	0.284	0.604	0.512	0.515	0.494
Llama2-7B	0.693	0.321	0.413	0.257	0.571	0.136	0.409	0.434
Llama3-8B	0.729	0.355	0.620	0.276	0.601	0.501	0.434	0.504
OPT-1.3B	0.595	0.312	0.249	n/a	0.4154	0.008	n/a	0.234

TheRootOf3 commented 3 months ago

Each complete run takes ~10h for 7/8B param models.

TheRootOf3 commented 2 months ago

TL;DR

Olmo and OPT are pretty much the same.
None of the models can do logiQA -- each gets the same as random guess.
Phi and Gemma better than Olmo and OPT. Phi generally better than Gemma.
Phi is the worst for French -- Gemma is significantly better.

	microsoft__phi-1_5	facebook__opt-1	google__gemma-2b	allenai__OLMo-1B-hf
winogrande_acc	0.729282	0.595896	0.651144	0.599842
truthfulqa_mc2_acc	0.408653	0.386806	0.330601	0.32943
hellaswag_acc	0.479685	0.414858	0.526987	0.469727
gsm8k_exact_match_flexible	0.319181	0.0166793	0.176649	0.0235027
arc_challenge_acc	0.446246	0.232935	0.40529	0.286689
mmlu_acc	0.407136	0.250677	0.328372	0.242344
logiqa_acc	0.239631	0.22427	0.236559	0.242704
french_bench_acc	0.299816	0.310159	0.397667	0.322455
piqa_acc	0.76605	0.718716	0.769859	0.750272
squadv2	19.2253	12.2118	22.1386	15.6277

Analysis:

PROBLEM: All models are random on logiQA (each question of logiQA has 4 possible answers, hence random guess is 0.25) -- we can't get lower than what we already have... Looks like we need to either find a different benchmark or drop logical reasoning.
Phi is generally better than gemma but gemma is significantly better on FrenchBench -> also, Phi is the worst model out of all evaluated for French.
OLMo 1B and OPT1.3B are pretty comparable so there is no significant difference which would be used.

Willmish commented 2 months ago

Gemma 2B Base	Tasks	Version	Filter	n-shot	Metric		Value		Stderr
mnli	1	none	0	acc	↑	0.3979	±	0.0049

Phi 1.5 Base	Tasks	Version	Filter	n-shot	Metric		Value		Stderr
mnli	1	none	0	acc	↑	0.514	±	0.005

Opt 1.3B	Tasks	Version	Filter	n-shot	Metric		Value		Stderr
mnli	1	none	0	acc	↑	0.3582	±	0.0048

@Davidyz

TheRootOf3 commented 2 months ago

Update as of 2024-07-01:

Go back to large-scale models (<10B) with all new benchmarks (e.g. mnli, squad).
Worth evaluating Gemma 2 9B.

TheRootOf3 commented 1 month ago

Update as of 2024-07-10:

Evaluation results (all 0-shot):

Model name	Winogrande (acc)	TruthfulQA MC2 (acc)	MMLU (acc)	LogiQA (acc)	HellaSwag (acc)	GSMK8 (em)	French Bench (acc)	ARC-c (acc)	MNLI (acc)	PIQA (acc)	Squadv2 (f1)	toxigen (acc)
Gemma-7B	0.727	0.378	0.612	0.284	0.604	0.512	0.515	0.494	n/a	n/a	n/a	n/a
Llama2-7B	0.693	0.321	0.413	0.257	0.571	0.136	0.409	0.434	n/a	n/a	n/a	n/a
Llama3-8B	0.732	0.439	0.620	0.273	0.602	0.506	0.435	0.498	0.478	0.797	32.39	0.430
Gemma2-9B									n/a	n/a	n/a	n/a

Willmish commented 1 month ago

Unlearned llama3 8B sequential 16; batch 1024; full precision

	winogrande_acc	truthfulqa_mc2_acc	hellaswag_acc	arc_challenge_acc	mmlu_acc	french_bench_acc	piqa_acc	squadv2	mnli
20	0.7119179163378059		0.5256920932085242	0.34812286689419797	0.5072639225181598	0.3319351873132613	0.5924918389553863	26.296920409865134	0.356698930208864

4 splits, sample count 1024		winogrande_acc	truthfulqa_mc2_acc	hellaswag_acc	arc_challenge_acc	mmlu_acc	french_bench_acc	piqa_acc	squadv2	mnli
20	0.6985003946329913		0.5198167695678152	0.40784982935153585	0.5401652186298248	0.28223396920248217	0.6528835690968444	29.024508296607614	0.4053998981151299

TheRootOf3 commented 1 month ago

Update as of 2024-07-25:

	llama3.1-8b	llama3-8b
beaverdam_flagged	-	0.334286
winogrande_acc	0.734807	0.732439
truthfulqa_mc2_acc	0.451701	0.43907
hellaswag_acc	0.600378	0.602171
gsm8k_exact_match_flexible	0.492039	0.506444
arc_challenge_acc	0.511092	0.498294
mmlu_acc	0.62954	0.619926
french_bench_acc	0.43875	0.434728
piqa_acc	0.801415	0.796518
squadv2_f1	33.0341	32.3923
mnli_acc	0.498319	0.478349
toxigen_acc	0.426596	0.429787

Willmish commented 1 month ago

PASTING HERE BEFORE WE FIND A BETTER PLACE FOR IT: @TheRootOf3 @Adamliu1 @Davidyz

Below are results for final checkpoints after 10240 steps, batch size 2 of continuous unlearning WITH SCHEDULER: unlearn set: PKU-Harmful retain set: squad

TL;DR: only thing that really took a hit is: safety (beaverdam_flagged) got worse (more flagged) for lr 5e-7 (0.40), 1e-6 (0.43) and better for lr 5e-6 (0.14) AND gsm8k: visible for most lrs, but specifically 5e-6 (0.288!)

Learning Rate	1e-8	5e-8	1e-7	5e-7	1e-6	5e-6
beaverdam_flagged	0.3271428571428571	0.33285714285714285	0.32	0.4014285714285714	0.43142857142857144	0.14285714285714285
winogrande_acc	0.7316495659037096	0.728492501973165	0.7308602999210734	0.7316495659037096	0.7387529597474349	0.739542225730071
truthfulqa_mc2_acc	0.4288239712886645	0.4190040177748152	0.4145185982379925	0.41006442563278356	0.396623651814244	0.4076632565867259
hellaswag_acc	0.6002788289185421	0.5970922127066322	0.5938060147381	0.5994821748655647	0.5979884485162318	0.5846444931288588
gsm8k_exact_match_flexible	0.5087187263078089	0.5011372251705838	0.489764973464746	0.4715693707354056	0.4670204700530705	0.2880970432145565
arc_challenge_acc	0.5017064846416383	0.4872013651877133	0.48464163822525597	0.5042662116040956	0.5093856655290102	0.48976109215017066
mmlu_acc	0.6183592080900157	0.619427431989745	0.6213502350092579	0.6187864976499075	0.6177182737501781	0.5820395954992167
french_bench_acc	0.4333486554814985	0.43191220409101355	0.43024592047805105	0.4334061135371179	0.43185474603539414	0.39847161572052403
piqa_acc	0.7927094668117519	0.7899891186071817	0.7916213275299239	0.7927094668117519	0.7976060935799782	0.7872687704026116
squadv2_f1	32.2717276192614	33.18570231581052	32.55705207756308	32.510930521021656	30.05643833456453	31.14084319890086
mnli_acc	0.4794701986754967	0.4813041263372389	0.4793683138053999	0.5069791136016302	0.5008660213958227	0.5046357615894039
toxigen_acc	0.4276595744680851	0.42446808510638295	0.42340425531914894	0.42872340425531913	0.4297872340425532	0.425531914893617

Willmish commented 1 month ago

Also in a PNG format: output

Willmish commented 1 week ago

done (but these are not all base models lol)

Adamliu1 / SNLP_GCW

Evaluate base models with lm-evaluation-harness. #76

TL;DR