experiment: long running small model (v2)

Coming from:

The objective of this experiment is to determine the convergence value of a long running experiment on a small model. In case of #52, this value was 1.1.

Based on #48, I've determined that an LR scaling factor of 1.0 might be more beneficial to ensure stability. I've also noticed that smaller models will have a loss function that behaves in a more deterministic manner, so I'm not creating several runs for this one as I have no reason to believe that the loss graphs would diverge (no reason yet).

Configuration	Value
attention	metric
batch_size	10
beta_1	0.9
beta_2	0.98
bias	False
coordinates	200
epsilon	1e-09
l1_regularization	0.0
l2_regularization	0.0
lr_schedule_scaling	1.0
number_of_blocks	1
number_of_epochs	1
number_of_heads	10
number_of_parameters	10,582,700
number_of_slices	50
tokens	50,263
warmup_steps	4000
words	624

Digital-Defiance / nlp-metaformer

experiment: long running small model (v2) #54

52

48