Digital-Defiance / nlp-metaformer

An ablation study on the transformer network for Natural Language Processing
3 stars 0 forks source link

experiment: base case #60

Closed RuiFilipeCampos closed 5 months ago

RuiFilipeCampos commented 5 months ago

newplot(65)

RuiFilipeCampos commented 5 months ago

loss train vs relative time, 16dim, variable depth

newplot(66)

There seems to be no advantage in increasing the model depth other than increasing training time.

This is strongly supported by the following graph, which is the same training loss but with step as the x-axis:

newplot(67)

This can be correlated with the number of parameters:

2024-02-17-072759_1900x135_scrot

So there's a huge time time penalty when increasing model depth, but not a lot to gain in terms of model expressiveness, at least at this small scale.

Given how bad the time penalty is, I'm gonna stick to a model depth of 1 for now.

RuiFilipeCampos commented 5 months ago

the effect of the number of heads is varied, but it clearly prefers 1 head

vs step

newplot(68)

vs relative time (blue run not finished)

newplot(69)

I'm going to vary the embedding dimensions, keeping 1 head 1 block, if any other value outperforms 16 dimensions I'll try to see the effect of the number of heads there

at embedding dimension of 16, the number of heads is an impactful parameter since it determines the dimension of the projections, at 16 dim and 8 heads, the self attention module is working with vectors with only two dimensions, I can't really say what kind of effect that has, but I'd imagine that less dimensions mean less information to work with when calculating the attention scores

i also think that a reasonable guess is that the time delay seen between the curves is gonna be a function of the number of attention heads, depth is an indirect multiplier of that hyper parameter

RuiFilipeCampos commented 5 months ago

newplot(70)

32dim, 1head, 1block looks promising

and it seems to be the result of the increase in parameter count:

2024-02-17-112251_1906x172_scrot

so this proves that the performance issues are not related to the parameter count, I'll have to look into my self attention code

RuiFilipeCampos commented 5 months ago

I'm adding more metrics, I need a clearer picture of what's going on

over fitting seems to be fully resolved

and this graph from the previous failed run seems to paint a picture of what the trend is gonna be when increasing dimension

newplot(71)

the value of cross entropy loss at the end of these runs might be acceptable, they are already far away from 1.6 (random guess on 5 classes), at 0.8-1, even if the model should be expected to have more certainty in the predictions, the outside world will only care for the argmax index and not the particular value of the logits that lead to it - that is, if the stats of the output classifications are good over 60k samples, the model is trained even tho loss has a lot of room to improve

RuiFilipeCampos commented 5 months ago

accuracy actually reaches 50% and starts creeping upwards

RuiFilipeCampos commented 5 months ago

newplot(72)

I suppose it's a pretty clear trend, new one is 512dim

problem is that 32dim still outperforms it because it takes more time per gradient descent step

RuiFilipeCampos commented 5 months ago

hard to see due to poor labeling, but the smaller batch sizes are out performing the larger ones

gradient accumulation works quite well and doesn't seem to be a factor in performance, the instability that comes from smaller batches is kept during their gradient accumulation though

newplot(78)

not by a lot I suppose, and there's always the question of what would happen if the run had continued, at the moment I'm leaning towards small batch size

newplot(73) newplot(74) newplot(75) newplot(80) newplot(81) newplot(77) newplot(82)

RuiFilipeCampos commented 5 months ago

bad news coming from the long running experiment

newplot(83)

RuiFilipeCampos commented 5 months ago

2024-02-18-193248_1191x643_scrot

issue has been found

the first layer is causing the whole thing to overfit, the embedder module has to be pre-trained

this is why reducing it to 1 block 1 head made it converge faster, it just reduced the distance between the first layer and the output feed forward

RuiFilipeCampos commented 5 months ago

I'm merging this, next week the infrastructure will be refactored to use prefect instead of github actions