RuiFilipeCampos commented 5 months ago

The objective of this experiment is to understand the effect of the number of layers on the overall training performance.

The repeated jobs are there as a sanity check that the convergence is not stochastic.

RUN: https://github.com/Digital-Defiance/llm-voice-chat/actions/runs/7862144872

lr(step)

newplot(16)

Hyper parameters

This is the only difference between the models (I've also added the gpu memory pressure during training as a reference):

Number of Blocks	Number of Parameters	GPU Memory Usage (%)	GPU Utilization (%)
6	12,596,200	93.3	100
5	12,193,300	80.7	98
4	11,790,400	66.4	98
3	11,387,500	56.0	93
2	10,984,600	43.5	92
1	10,581,700	29.0	96

Configuration	Value
attention	metric
batch_size	10
beta_1	0.9
beta_2	0.98
bias	False
coordinates	200
epsilon	1e-09
l1_regularization	0.0
l2_regularization	0.0
lr_schedule_scaling	100.0
number_of_blocks	VARIABLE
number_of_epochs	1
number_of_heads	20
number_of_parameters	VARIABLE
number_of_slices	5
tokens	50,263
warmup_steps	4000
words	624

Related experiments:

https://github.com/Digital-Defiance/llm-voice-chat/pull/52

RuiFilipeCampos commented 5 months ago

MODEL_NUMBER_OF_BLOCKS=1

loss/train(step)

newplot(18)

attachments

metrics(5).csv

RuiFilipeCampos commented 5 months ago

MODEL_NUMBER_OF_BLOCKS=2

loss/train(step)

newplot(28)

observations

the choice of LR schedule might've not been ideal (peaks at 0.02)

RuiFilipeCampos commented 5 months ago

MODEL_NUMBER_OF_BLOCKS=3

loss/train(step)

newplot(31)

Run ID	Run Name	Start Time	End Time	Duration
2d1caeb0ecda46fca85e3869673acc56	c1f6180b8e7c458fd6730285d6f1cfa8c038cac6-10	2024-02-11 16:12:14	2024-02-11 17:05:46	53.5min
6585b9afae7d4ad4829d8fb99f99ac0f	c1f6180b8e7c458fd6730285d6f1cfa8c038cac6-10	2024-02-11 17:06:11	2024-02-11 17:59:34	53.4min
26d54667a00f4732b99dacc293e2166c	c1f6180b8e7c458fd6730285d6f1cfa8c038cac6-10	2024-02-11 18:00:00	2024-02-11 18:53:33	53.5min

RuiFilipeCampos commented 5 months ago

I need to get this variation under control, otherwise there's no predicting what each hyper parameter does

really don't want to set seeds as that runs the risk of just masking the issue

RuiFilipeCampos commented 5 months ago

The solution is to first run N short lived training loops with the same hyper parameters and pick the best one as the initialization of a long running training loop.

RuiFilipeCampos commented 5 months ago

MODEL_NUMBER_OF_BLOCKS=4

newplot(34)

Run ID	Run Name	Start Time	End Time	Duration
73ae972fdc2d47a3a1d675716fcbba52	c1f6180b8e7c458fd6730285d6f1cfa8c038cac6-10	2024-02-11 18:54:00	2024-02-11 20:04:44	1.2h
bb61dabb09514ad3b125f2e16a607136	c1f6180b8e7c458fd6730285d6f1cfa8c038cac6-10	2024-02-11 20:05:09	2024-02-11 21:15:37	1.2h
0f6712a874a94cfcabb89e50e7c16256	c1f6180b8e7c458fd6730285d6f1cfa8c038cac6-10	2024-02-11 21:16:03	2024-02-11 22:26:37	1.2h

MODEL_NUMBER_OF_BLOCKS=5

newplot(33)

Run ID	Run Name	Start Time	End Time	Duration
51aadc4ed0ea4b169ea54d5ce290a4e7	c1f6180b8e7c458fd6730285d6f1cfa8c038cac6-10	2024-02-11 22:27:03	2024-02-11 23:54:29	1.5h
99a2f0991d9b4e27bd1b6acc1b2e2b9b	c1f6180b8e7c458fd6730285d6f1cfa8c038cac6-10	2024-02-11 23:54:54	2024-02-12 01:22:40	1.5h
3d70a793f5ae4e6d8a79b4e200c6976e	c1f6180b8e7c458fd6730285d6f1cfa8c038cac6-10	2024-02-12 01:23:05	2024-02-12 02:50:44	1.5h

RuiFilipeCampos commented 5 months ago

MODEL_NUMBER_OF_BLOCKS=1

loss/train(step)

attachments

metrics(5).csv

I'm gonna start a long running experiment for this one

this configuration seems stable and comparing it to https://github.com/Digital-Defiance/llm-voice-chat/pull/52 could prove to be very useful

https://github.com/Digital-Defiance/llm-voice-chat/pull/54

Digital-Defiance / nlp-metaformer

experiment: explore effect of model depth #48

lr(step)

Hyper parameters

Related experiments:

MODEL_NUMBER_OF_BLOCKS=1

loss/train(step)

attachments

MODEL_NUMBER_OF_BLOCKS=2

loss/train(step)

observations

MODEL_NUMBER_OF_BLOCKS=3

loss/train(step)

MODEL_NUMBER_OF_BLOCKS=4

MODEL_NUMBER_OF_BLOCKS=5

MODEL_NUMBER_OF_BLOCKS=1

loss/train(step)

attachments