Digital-Defiance / nlp-metaformer

An ablation study on the transformer network for Natural Language Processing
3 stars 0 forks source link

experiment: explore effect of model depth #48

Closed RuiFilipeCampos closed 5 months ago

RuiFilipeCampos commented 5 months ago

The objective of this experiment is to understand the effect of the number of layers on the overall training performance.

The repeated jobs are there as a sanity check that the convergence is not stochastic.

RUN: https://github.com/Digital-Defiance/llm-voice-chat/actions/runs/7862144872

lr(step)

newplot(16)

Hyper parameters

This is the only difference between the models (I've also added the gpu memory pressure during training as a reference):

Number of Blocks Number of Parameters GPU Memory Usage (%) GPU Utilization (%)
6 12,596,200 93.3 100
5 12,193,300 80.7 98
4 11,790,400 66.4 98
3 11,387,500 56.0 93
2 10,984,600 43.5 92
1 10,581,700 29.0 96
Configuration Value
attention metric
batch_size 10
beta_1 0.9
beta_2 0.98
bias False
coordinates 200
epsilon 1e-09
l1_regularization 0.0
l2_regularization 0.0
lr_schedule_scaling 100.0
number_of_blocks VARIABLE
number_of_epochs 1
number_of_heads 20
number_of_parameters VARIABLE
number_of_slices 5
tokens 50,263
warmup_steps 4000
words 624

Related experiments:

RuiFilipeCampos commented 5 months ago

MODEL_NUMBER_OF_BLOCKS=1

loss/train(step)

newplot(18)

attachments

metrics(5).csv

RuiFilipeCampos commented 5 months ago

MODEL_NUMBER_OF_BLOCKS=2

loss/train(step)

newplot(28)

observations

RuiFilipeCampos commented 5 months ago

MODEL_NUMBER_OF_BLOCKS=3

loss/train(step)

newplot(31)

Run ID Run Name Start Time End Time Duration
2d1caeb0ecda46fca85e3869673acc56 c1f6180b8e7c458fd6730285d6f1cfa8c038cac6-10 2024-02-11 16:12:14 2024-02-11 17:05:46 53.5min
6585b9afae7d4ad4829d8fb99f99ac0f c1f6180b8e7c458fd6730285d6f1cfa8c038cac6-10 2024-02-11 17:06:11 2024-02-11 17:59:34 53.4min
26d54667a00f4732b99dacc293e2166c c1f6180b8e7c458fd6730285d6f1cfa8c038cac6-10 2024-02-11 18:00:00 2024-02-11 18:53:33 53.5min
RuiFilipeCampos commented 5 months ago

I need to get this variation under control, otherwise there's no predicting what each hyper parameter does

really don't want to set seeds as that runs the risk of just masking the issue

RuiFilipeCampos commented 5 months ago

The solution is to first run N short lived training loops with the same hyper parameters and pick the best one as the initialization of a long running training loop.

RuiFilipeCampos commented 5 months ago

MODEL_NUMBER_OF_BLOCKS=4

newplot(34)

Run ID Run Name Start Time End Time Duration
73ae972fdc2d47a3a1d675716fcbba52 c1f6180b8e7c458fd6730285d6f1cfa8c038cac6-10 2024-02-11 18:54:00 2024-02-11 20:04:44 1.2h
bb61dabb09514ad3b125f2e16a607136 c1f6180b8e7c458fd6730285d6f1cfa8c038cac6-10 2024-02-11 20:05:09 2024-02-11 21:15:37 1.2h
0f6712a874a94cfcabb89e50e7c16256 c1f6180b8e7c458fd6730285d6f1cfa8c038cac6-10 2024-02-11 21:16:03 2024-02-11 22:26:37 1.2h

MODEL_NUMBER_OF_BLOCKS=5

newplot(33)

Run ID Run Name Start Time End Time Duration
51aadc4ed0ea4b169ea54d5ce290a4e7 c1f6180b8e7c458fd6730285d6f1cfa8c038cac6-10 2024-02-11 22:27:03 2024-02-11 23:54:29 1.5h
99a2f0991d9b4e27bd1b6acc1b2e2b9b c1f6180b8e7c458fd6730285d6f1cfa8c038cac6-10 2024-02-11 23:54:54 2024-02-12 01:22:40 1.5h
3d70a793f5ae4e6d8a79b4e200c6976e c1f6180b8e7c458fd6730285d6f1cfa8c038cac6-10 2024-02-12 01:23:05 2024-02-12 02:50:44 1.5h
RuiFilipeCampos commented 5 months ago

MODEL_NUMBER_OF_BLOCKS=1

loss/train(step)

newplot(18)

attachments

metrics(5).csv

I'm gonna start a long running experiment for this one

this configuration seems stable and comparing it to https://github.com/Digital-Defiance/llm-voice-chat/pull/52 could prove to be very useful