experiment: testing pipelines

Digital-Defiance / nlp-metaformer

An ablation study on the transformer network for Natural Language Processing

3 stars 0 forks source link

experiment: testing pipelines #59

Closed RuiFilipeCampos closed 5 months ago

RuiFilipeCampos commented 5 months ago

e4bd5de

RuiFilipeCampos commented 5 months ago

newplot(37)

There's definitely some memory leakage going on

RuiFilipeCampos commented 5 months ago

2024-02-12-095050_1520x217_scrot

cleanup tasks claim to be successful

RuiFilipeCampos commented 5 months ago

newplot(39)

2024-02-12-100900_1907x524_scrot

blue = 200
x axis = step

RuiFilipeCampos commented 5 months ago

newplot(40)

model size is not helping at all

RuiFilipeCampos commented 5 months ago

https://stackoverflow.com/questions/17541452/celery-does-not-release-memory

RuiFilipeCampos commented 5 months ago

newplot(41)

They keep converging to the same loss value, regardless of size.

I'm going to increase the warmup period to 15k

RuiFilipeCampos commented 5 months ago

newplot(42)

I'm gonna need to consider other sources of memory leakage

RuiFilipeCampos commented 5 months ago

2024-02-12-130925_1506x445_scrot

if the results backend is being flushed (twice), then how come flower is able to print them out

RuiFilipeCampos commented 5 months ago

newplot(44)

fixed seed has been confirmed

RuiFilipeCampos commented 5 months ago

2024-02-12-150425_850x257_scrot

memory leak appears to be solved

RuiFilipeCampos commented 5 months ago

newplot(49)

batch size appears to be the critical bit here, blue is 1024 and the slope has converged to a higher value

RuiFilipeCampos commented 5 months ago

the batch size has definitely improved the case

newplot(50)

newplot(51)

the new run is using a mini batch size of 4x1024, should take quite a while to get the results out though

RuiFilipeCampos commented 5 months ago

newplot(52)

gpu usage peaking at 70%, no leakage over the 6h

RuiFilipeCampos commented 5 months ago

newplot(54)

cpu mem percentage, no leakage here too

RuiFilipeCampos commented 5 months ago

newplot(55)

something fishy going on with disk memory tho

i reckon it must be the .npy files somehow

RuiFilipeCampos commented 5 months ago

I'm gonna choose (256dim,16heads,6blocks,50ctx) as my base case. It fits in the 16gb with room for a large batch size.

The convergence is slow but it is there and it is stable.

There's several paths for optimization:

rework the data (priority)
hyper parameter tuning - probably good to check the lamma paper for reference
scale the hardware - federated learning, TPU instances, etc

RuiFilipeCampos commented 5 months ago

https://arxiv.org/pdf/2302.13971.pdf

https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html
- "β1 = 0.9, β2 = 0.95"
cosine learning rate schedule, final learning rate is equal to 10% of the maxi- mal learning rate
weight decay of 0.1 and gradient clipping of 1.0
2000 warmup
vary the learning rate and batch size with the size of the model

2024-02-14-085651_989x326_scrot

RuiFilipeCampos commented 5 months ago

will get the pipeline working and merge this right afterwards, pr is way too big rn