CalculatedContent / WeightWatcher

The WeightWatcher tool for predicting the accuracy of Deep Neural Networks
Apache License 2.0
1.47k stars 124 forks source link

Relation of alpha with over fitting and under fitting #235

Open Anshul12256 opened 1 year ago

Anshul12256 commented 1 year ago

Hi everyone, I saw your video on youtube (https://www.youtube.com/live/Tnafo6JVoJs?feature=share) which describes the WeightWatcher tool and its intricacies. In that video, it was mentioned that a good fit of the power law generates an alpha between [2,6], and layers of the model with a higher alpha (GPT vs GPT2 Layer 146 Refrence) are under-fitting. However, how alpha relates to overfitting is something I cannot wrap my head around because when I ran the analyze command on my tensorflow model and analyzed the data frame I noticed the following:

Any explanations are appreciated. Thanks a lot.

charlesmartin14 commented 1 year ago

The weightwatcher theory states that layers with alpha < 2 may be overfit so a warning is issued for these layers

The weightwatcher theory is based on the statistical mechanics theory of generalization; in this theory, it is predicted that overfitting occurs when a neural network enters the spin-glass phase. This is well established in the theoretical physics community from the 80s and 90s, but is probably unknown to the current ML / AI community

For example, here is a paper from Physics Today (1988) https://physicstoday.scitation.org/doi/10.1063/1.881142

The weightwatcher theory is outlined in this blog post https://calculatedcontent.com/2019/12/03/towards-a-new-theory-of-learning-statistical-mechanics-of-deep-neural-networks/

The state of alpha < 2 is thought to be associated with a spin-glass-like phase of the model because the weight matrix appears to be aytpical in this case (i.e it does not have a true mean value)

I say 'may' because it is difficult to get a highly accurate estimate of alpha for most small layers, and there may be some noise or error in the alpha estimator. The alpha error bar may be estimated with the sigma column; we continue to look for better ways to estimate the error bar

I also say 'may' because there are other effects, other than overfitting, that can cause alpha to drop below 2, and it is sometimes difficult to identify these. For example, using a very large learning rate can also cause alpha to drop below 2, which may be spurious effect since large learning rate usually lead to bad generalization

I encourage you to join our Discord channel to discuss https://discord.com/invite/uVVsEAcfyF