Closed RuiFilipeCampos closed 5 months ago
There's definitely some memory leakage going on
cleanup tasks claim to be successful
model size is not helping at all
They keep converging to the same loss value, regardless of size.
I'm going to increase the warmup period to 15k
I'm gonna need to consider other sources of memory leakage
if the results backend is being flushed (twice), then how come flower is able to print them out
fixed seed has been confirmed
memory leak appears to be solved
batch size appears to be the critical bit here, blue is 1024 and the slope has converged to a higher value
the batch size has definitely improved the case
the new run is using a mini batch size of 4x1024, should take quite a while to get the results out though
gpu usage peaking at 70%, no leakage over the 6h
cpu mem percentage, no leakage here too
something fishy going on with disk memory tho
i reckon it must be the .npy files somehow
I'm gonna choose (256dim,16heads,6blocks,50ctx) as my base case. It fits in the 16gb with room for a large batch size.
The convergence is slow but it is there and it is stable.
There's several paths for optimization:
https://arxiv.org/pdf/2302.13971.pdf
will get the pipeline working and merge this right afterwards, pr is way too big rn
e4bd5de