Benchmarking Memory Consumption of Optimizers Adam v.s. Adan

Benchmarking Results

The memory benchmarking is conducted based on the following config:

Head	Layers	Emb. Dim	Model Size (MB)	Adam Peak (MB)	Adan Peak (MB)	$\Delta$ (%)
6	6	768	81	4490	4490	0
12	6	768	81	5848	5848	0
16	6	768	81	6776	6776	0
6	12	768	124	7151	7153	0.03
12	12	768	124	9869	9871	0.02
16	12	768	124	11733	11735	0.02
16	6	1024	128	7302	7304	0.03
16	12	1024	203	12719	12721	0.02
6	24	768	209	12471	12475	0.03
12	24	768	209	17922	17922	0
16	24	768	209	21596	21600	0.02
6	6	1536	248	6905	8241	19.35
12	6	1536	248	8235	8539	3.69
16	6	1536	248	9141	9445	3.33
16	24	1024	354	23530	23534	0.02
16	6	2048	407	11098	12159	9.56
6	12	1536	418	11137	13778	23.71
12	12	1536	418	13390	14164	5.78
16	12	1536	418	15667	15976	1.97
16	6	2560	603	13967	18207	30.36
16	12	2048	709	18851	20954	11.16
6	24	1536	758	19660	24819	26.24
12	24	1536	758	25096	25406	1.24
16	24	1536	758	28720	29030	1.08
16	12	2560	1075	28475	32134	12.85
16	24	2048	1313	34357	38595	12.34

The extra memory consumption does not increase linearly with the size of the model.
In most cases Adan's additional memory footprint does not exceed 10%.
However, when the embedding dimension (Emb. Dim) increases, the probability that Adan's extra memory is larger also increases.