Closed RuiFilipeCampos closed 5 months ago
loss train vs relative time, 16dim, variable depth
There seems to be no advantage in increasing the model depth other than increasing training time.
This is strongly supported by the following graph, which is the same training loss but with step as the x-axis:
This can be correlated with the number of parameters:
So there's a huge time time penalty when increasing model depth, but not a lot to gain in terms of model expressiveness, at least at this small scale.
Given how bad the time penalty is, I'm gonna stick to a model depth of 1 for now.
the effect of the number of heads is varied, but it clearly prefers 1 head
vs step
vs relative time (blue run not finished)
I'm going to vary the embedding dimensions, keeping 1 head 1 block, if any other value outperforms 16 dimensions I'll try to see the effect of the number of heads there
at embedding dimension of 16, the number of heads is an impactful parameter since it determines the dimension of the projections, at 16 dim and 8 heads, the self attention module is working with vectors with only two dimensions, I can't really say what kind of effect that has, but I'd imagine that less dimensions mean less information to work with when calculating the attention scores
i also think that a reasonable guess is that the time delay seen between the curves is gonna be a function of the number of attention heads, depth is an indirect multiplier of that hyper parameter
32dim, 1head, 1block looks promising
and it seems to be the result of the increase in parameter count:
so this proves that the performance issues are not related to the parameter count, I'll have to look into my self attention code
I'm adding more metrics, I need a clearer picture of what's going on
over fitting seems to be fully resolved
and this graph from the previous failed run seems to paint a picture of what the trend is gonna be when increasing dimension
the value of cross entropy loss at the end of these runs might be acceptable, they are already far away from 1.6 (random guess on 5 classes), at 0.8-1, even if the model should be expected to have more certainty in the predictions, the outside world will only care for the argmax index and not the particular value of the logits that lead to it - that is, if the stats of the output classifications are good over 60k samples, the model is trained even tho loss has a lot of room to improve
accuracy actually reaches 50% and starts creeping upwards
I suppose it's a pretty clear trend, new one is 512dim
problem is that 32dim still outperforms it because it takes more time per gradient descent step
hard to see due to poor labeling, but the smaller batch sizes are out performing the larger ones
gradient accumulation works quite well and doesn't seem to be a factor in performance, the instability that comes from smaller batches is kept during their gradient accumulation though
not by a lot I suppose, and there's always the question of what would happen if the run had continued, at the moment I'm leaning towards small batch size
bad news coming from the long running experiment
issue has been found
the first layer is causing the whole thing to overfit, the embedder module has to be pre-trained
this is why reducing it to 1 block 1 head made it converge faster, it just reduced the distance between the first layer and the output feed forward
I'm merging this, next week the infrastructure will be refactored to use prefect instead of github actions