using alpha monitor the training process of the model

cobraheleah commented 8 months ago

I use alpha value to monitor the training effect of the model, but I found that as the training progresses, the alpha value will rapidly decrease in the early stage and slowly increase in the middle and later stages. The specific data is as follows:

training step, alpha value 1k, 28.94 2k, 15 3k, 7.1 5k, 4.4 6k, 4.2 100k, 4.7 200k, 4.9 300k, 5.0 400k, 5.1 500k, 5.2

However, throughout the entire training process, the evaluation effect of the model (eg. mmlu, bbh) has been improving. Is it normal for the alpha value to slowly increase in the middle and later stages of training? It seems that the size of the alpha value is not absolutely related to the evaluation effect of the model.

charlesmartin14 commented 8 months ago

The short answer is — the individual layer alphas give some estimate of how well each layer has converged--but some layers may be converging faster or slower than others (or even backtracking), causing the average alpha to go up So has to take the average in a robust way, and the tool right now does something very simple

If I can see more of your data and understand your model better I can address this

cobraheleah commented 8 months ago

Thank you very much for your response. Here is the basic situation of the model training: The model is a LLM with a parameter size of around 60B, and its structure is similar to llama2. The data comes from public datasets. I have three questions to ask:

During the training process, it was found that the alpha values of certain layers changed significantly. Upon further analysis, it was discovered that the V weight matrix in 6th layer self-attention had large fluctuations in the alpha values, as shown below: training step, alpha value of 6th layer self-attention V weight matrix 502k,12.85 503k,5.2 504k,6.4 505k,6.1 506k,10.5 507k,6.8 Is this kind of change normal? From the downstream evaluation results, no anomalies were found in the checkpoints with large alpha values.
If alpha values can only reflect the convergence process of each layer, what kind of metric should be used to measure the training of the entire model? Or how can a more robust average metric be used?
I saw on https://weightwatcher.ai/ that an alpha value for a layer between 2 and 6 is considered reasonable. Can you please clarify whether this specific range is theoretically derived or experimentally obtained?

charlesmartin14 commented 8 months ago

1) I suspect Llama is too big for your data set. In fact, we think that Llama itself is not well sized for its data...see this comparison with Falcon https://weightwatcher.ai/leaderboard.html

Screenshot 2024-03-07 at 9 53 12 AM

Both Falcon and Mistral show much better quality scores that LLama

it appears that as some of the LLama layers are learning information, others are just becoming more random. There are 2 things you can do

1a) check for Dragon Kings https://calculatedcontent.com/2024/01/29/evaluating-llms-with-weightwatcher-part-iii-the-magic-of-mistral-a-story-of-dragon-kings/

This takes a little more compute, but it can sometimes alphas that are unusually large 1b) See also the ShortGPT study on pruning models: ShortGPT https://twitter.com/_akhaliq/status/1765607379264024693

1c) don't include any layer with alpha > 8 in your average That is... 2) Use robust statistics...that is, compute the median alpha and/or throw away outliers. 2a) I can add method to weightwatcher to do this

2b) Also, If you are training models from scratch, then spectral density of the data jacobian / gradients is also useful to observe. But this is very expensive to compute, and usually cost prohibitive

3) The alpha range comes from the JMLR theory paper https://jmlr.org/papers/v22/20-410.html

HTSR

The theory is an infinite theory limit, so the values should not depend on the size and shape of W But in practice, it does depend on these a little bit, so I usually say 6 is a good upper limit in general

*4) If you are training Llama from scratch, you might want to try the GaLore optimizer**

Memory-Efficient LLM Training by Gradient Low-Rank Projection https://lnkd.in/g2P_HTPE

cobraheleah commented 8 months ago

Thanks for your reply.There are still some remaining issues that I would like to seek advice on.

For the first question, I don’t think it is due to insufficient data. Because as you suggested, I don't include any layer with alpha > 8 in average and the phenomenon that alpha value will rapidly decrease in the early stage and slowly increase in the middle and later stages still exists. On the other hand, Other open-source models like Baichuan2-7B which Published the intermediate ckpt also exhibit this phenomenon that the alpha value first decreases and then increases. And our small sized model, like 6B, 10B, also exhibit this phenomenon. I wonder if anyone has used this alpha value to monitor the training process of the model and has similar phenomenon, rather than just comparing the final model. If so, could you please provide more information on this?
In https://weightwatcher.ai/leaderboard.html Comparison of Llama to Falcon，Falcon is well-sized，Llama is Widely-Overparametrized，but the alpha value of 40b-instruct Falcon is higher than 65b Llama, is it saying that we shouldn't just compare the average alpha value, and that a better approach would be to compare the distribution of alpha values? By the way, how is the alpha value of the model calculated here? Is it computed as an average, and are outliers removed?

charlesmartin14 commented 8 months ago

I don’t think it is due to insufficient data. ... , Other open-source models like Baichuan2-7B which Published the intermediate ckpt also exhibit this phenomenon that the alpha value first decreases and then increases.

Like Llama, the Baichuan2-7B model is thought to have underfit / redundant layers.
See the recent ShortGPT paper https://arxiv.org/abs/2403.03853

Because these layers are not converging and /or are redudant, the model is not well sized, and its possible that other layers are 'soaking up' the correlations, causing their layers alphas to be smaller than expected

The HTSR theory was developed and presented as a late-stage theory, where it was basically argued that the layers in the NN become PL near convergence.

https://jmlr.org/papers/v22/20-410.html Screenshot 2024-03-12 at 11 56 13 AM

"Depending on [blah blah blah], additional training can lead to a Heavy-Tailed model"

If you are going to apply weightwatcher early in training you need to check a few things because is quite possible that the fits early in training are simply spurious because the layer is so far from "convergence" (or just never becomes heavy tailed)

You can fit any data set and get an alpha, so in addition to computing alpha, you have to check that

the tail is large enough to get a reliable fit
the quality of fit (D) is good
the PL fit is stable
the ESD is unimodal, heavy-tailed, and sufficiently different from random
there are no Correlation Traps that can cause spuriously small alphas
there is no rank collapse which can cause spuriously small alphas
the layer alphas for the model correlate well with other metrics, such as the randdistance, spectral norm, distance from init (see this blog post: https://calculatedcontent.com/2021/10/17/fantastic-measures-of-generalization-that-actually-work-part-1/)
the eigenvectors of the tail have lower entropy than the bulk

Screenshot 2024-03-12 at 9 14 51 AM

the alpha value of 40b-instruct Falcon is higher than 65b Llama, is The quality of the alpha fit is frequently a more reliable metric than the value of alpha itself

A better approach would be to compare the distribution of alpha values? weightwatcher is a diagnostic tool for analyzing how models converge, layer-by-layer But the theory is only exact on single layer models (i.e it works perfectly on the original double descent problem, is well understood on small MLPs, etc)

I developed the tool to study how individual layer converges, the correlation flow, how layers inter-correlates with each other ,etc. but we don't fully understand how all these interactions affect convergence

Im happy to collaborate on this

CalculatedContent / WeightWatcher

using alpha monitor the training process of the model #307