Closed RuiFilipeCampos closed 5 months ago
Run ID | Run Name | Start Time | End Time | Duration |
---|---|---|---|---|
2d1caeb0ecda46fca85e3869673acc56 | c1f6180b8e7c458fd6730285d6f1cfa8c038cac6-10 | 2024-02-11 16:12:14 | 2024-02-11 17:05:46 | 53.5min |
6585b9afae7d4ad4829d8fb99f99ac0f | c1f6180b8e7c458fd6730285d6f1cfa8c038cac6-10 | 2024-02-11 17:06:11 | 2024-02-11 17:59:34 | 53.4min |
26d54667a00f4732b99dacc293e2166c | c1f6180b8e7c458fd6730285d6f1cfa8c038cac6-10 | 2024-02-11 18:00:00 | 2024-02-11 18:53:33 | 53.5min |
I need to get this variation under control, otherwise there's no predicting what each hyper parameter does
really don't want to set seeds as that runs the risk of just masking the issue
The solution is to first run N short lived training loops with the same hyper parameters and pick the best one as the initialization of a long running training loop.
Run ID | Run Name | Start Time | End Time | Duration |
---|---|---|---|---|
73ae972fdc2d47a3a1d675716fcbba52 | c1f6180b8e7c458fd6730285d6f1cfa8c038cac6-10 | 2024-02-11 18:54:00 | 2024-02-11 20:04:44 | 1.2h |
bb61dabb09514ad3b125f2e16a607136 | c1f6180b8e7c458fd6730285d6f1cfa8c038cac6-10 | 2024-02-11 20:05:09 | 2024-02-11 21:15:37 | 1.2h |
0f6712a874a94cfcabb89e50e7c16256 | c1f6180b8e7c458fd6730285d6f1cfa8c038cac6-10 | 2024-02-11 21:16:03 | 2024-02-11 22:26:37 | 1.2h |
Run ID | Run Name | Start Time | End Time | Duration |
---|---|---|---|---|
51aadc4ed0ea4b169ea54d5ce290a4e7 | c1f6180b8e7c458fd6730285d6f1cfa8c038cac6-10 | 2024-02-11 22:27:03 | 2024-02-11 23:54:29 | 1.5h |
99a2f0991d9b4e27bd1b6acc1b2e2b9b | c1f6180b8e7c458fd6730285d6f1cfa8c038cac6-10 | 2024-02-11 23:54:54 | 2024-02-12 01:22:40 | 1.5h |
3d70a793f5ae4e6d8a79b4e200c6976e | c1f6180b8e7c458fd6730285d6f1cfa8c038cac6-10 | 2024-02-12 01:23:05 | 2024-02-12 02:50:44 | 1.5h |
MODEL_NUMBER_OF_BLOCKS=1
loss/train(step)
attachments
I'm gonna start a long running experiment for this one
this configuration seems stable and comparing it to https://github.com/Digital-Defiance/llm-voice-chat/pull/52 could prove to be very useful
The objective of this experiment is to understand the effect of the number of layers on the overall training performance.
The repeated jobs are there as a sanity check that the convergence is not stochastic.
RUN: https://github.com/Digital-Defiance/llm-voice-chat/actions/runs/7862144872
lr(step)
Hyper parameters
This is the only difference between the models (I've also added the gpu memory pressure during training as a reference):
Related experiments: