Open cobraheleah opened 8 months ago
The short answer is — the individual layer alphas give some estimate of how well each layer has converged--but some layers may be converging faster or slower than others (or even backtracking), causing the average alpha to go up So has to take the average in a robust way, and the tool right now does something very simple
If I can see more of your data and understand your model better I can address this
Thank you very much for your response. Here is the basic situation of the model training: The model is a LLM with a parameter size of around 60B, and its structure is similar to llama2. The data comes from public datasets. I have three questions to ask:
1) I suspect Llama is too big for your data set. In fact, we think that Llama itself is not well sized for its data...see this comparison with Falcon https://weightwatcher.ai/leaderboard.html
Both Falcon and Mistral show much better quality scores that LLama
it appears that as some of the LLama layers are learning information, others are just becoming more random. There are 2 things you can do
1a) check for Dragon Kings https://calculatedcontent.com/2024/01/29/evaluating-llms-with-weightwatcher-part-iii-the-magic-of-mistral-a-story-of-dragon-kings/
This takes a little more compute, but it can sometimes alphas that are unusually large 1b) See also the ShortGPT study on pruning models: ShortGPT https://twitter.com/_akhaliq/status/1765607379264024693
1c) don't include any layer with alpha > 8 in your average That is... 2) Use robust statistics...that is, compute the median alpha and/or throw away outliers. 2a) I can add method to weightwatcher to do this
2b) Also, If you are training models from scratch, then spectral density of the data jacobian / gradients is also useful to observe. But this is very expensive to compute, and usually cost prohibitive
3) The alpha range comes from the JMLR theory paper https://jmlr.org/papers/v22/20-410.html
The theory is an infinite theory limit, so the values should not depend on the size and shape of W But in practice, it does depend on these a little bit, so I usually say 6 is a good upper limit in general
*4) If you are training Llama from scratch, you might want to try the GaLore optimizer**
Memory-Efficient LLM Training by Gradient Low-Rank Projection https://lnkd.in/g2P_HTPE
Thanks for your reply.There are still some remaining issues that I would like to seek advice on.
For the first question, I don’t think it is due to insufficient data. Because as you suggested, I don't include any layer with alpha > 8 in average and the phenomenon that alpha value will rapidly decrease in the early stage and slowly increase in the middle and later stages still exists. On the other hand, Other open-source models like Baichuan2-7B which Published the intermediate ckpt also exhibit this phenomenon that the alpha value first decreases and then increases. And our small sized model, like 6B, 10B, also exhibit this phenomenon. I wonder if anyone has used this alpha value to monitor the training process of the model and has similar phenomenon, rather than just comparing the final model. If so, could you please provide more information on this?
In https://weightwatcher.ai/leaderboard.html Comparison of Llama to Falcon,Falcon is well-sized,Llama is Widely-Overparametrized,but the alpha value of 40b-instruct Falcon is higher than 65b Llama, is it saying that we shouldn't just compare the average alpha value, and that a better approach would be to compare the distribution of alpha values? By the way, how is the alpha value of the model calculated here? Is it computed as an average, and are outliers removed?
I don’t think it is due to insufficient data. ... , Other open-source models like Baichuan2-7B which Published the intermediate ckpt also exhibit this phenomenon that the alpha value first decreases and then increases.
Like Llama, the Baichuan2-7B model is thought to have underfit / redundant layers.
See the recent ShortGPT paper
https://arxiv.org/abs/2403.03853
Because these layers are not converging and /or are redudant, the model is not well sized, and its possible that other layers are 'soaking up' the correlations, causing their layers alphas to be smaller than expected
The HTSR theory was developed and presented as a late-stage theory, where it was basically argued that the layers in the NN become PL near convergence.
https://jmlr.org/papers/v22/20-410.html
"Depending on [blah blah blah], additional training can lead to a Heavy-Tailed model"
If you are going to apply weightwatcher early in training you need to check a few things because is quite possible that the fits early in training are simply spurious because the layer is so far from "convergence" (or just never becomes heavy tailed)
You can fit any data set and get an alpha, so in addition to computing alpha, you have to check that
the tail is large enough to get a reliable fit
the quality of fit (D) is good
the PL fit is stable
the ESD is unimodal, heavy-tailed, and sufficiently different from random
there are no Correlation Traps that can cause spuriously small alphas
there is no rank collapse which can cause spuriously small alphas
the layer alphas for the model correlate well with other metrics, such as the randdistance, spectral norm, distance from init (see this blog post: https://calculatedcontent.com/2021/10/17/fantastic-measures-of-generalization-that-actually-work-part-1/)
the eigenvectors of the tail have lower entropy than the bulk
the alpha value of 40b-instruct Falcon is higher than 65b Llama, is The quality of the alpha fit is frequently a more reliable metric than the value of alpha itself
A better approach would be to compare the distribution of alpha values? weightwatcher is a diagnostic tool for analyzing how models converge, layer-by-layer But the theory is only exact on single layer models (i.e it works perfectly on the original double descent problem, is well understood on small MLPs, etc)
I developed the tool to study how individual layer converges, the correlation flow, how layers inter-correlates with each other ,etc. but we don't fully understand how all these interactions affect convergence
Im happy to collaborate on this
I use alpha value to monitor the training effect of the model, but I found that as the training progresses, the alpha value will rapidly decrease in the early stage and slowly increase in the middle and later stages. The specific data is as follows:
training step, alpha value 1k, 28.94 2k, 15 3k, 7.1 5k, 4.4 6k, 4.2 100k, 4.7 200k, 4.9 300k, 5.0 400k, 5.1 500k, 5.2
However, throughout the entire training process, the evaluation effect of the model (eg. mmlu, bbh) has been improving. Is it normal for the alpha value to slowly increase in the middle and later stages of training? It seems that the size of the alpha value is not absolutely related to the evaluation effect of the model.