Open richardtomsett opened 6 years ago
From previous review: A recent and promising approach due to Shwartz-Ziv and Tishby (2017) provides a deeper insight into some of the above results by analyzing deep networks using information theory. They calculate how information is preserved on each layer’s inputs and outputs using the Information Bottleneck framework. Their method shows that the common stochastic gradient descent optimization method for learning parameters undergoes two separate phases during training. Early on (the “drift” phase), the variance of the weights’ gradients is much smaller than the means of the gradients, indicating a high signal-to-noise ratio. Later during training (the “diffusion” phase), there is a rapid reversal such that the variance of the weights’ gradients becomes greater than the means of the gradients, indicating a low signal-to-noise ratio. During this diffusion phase, fluctuations dominate the stochastic gradient descent and the error saturates. These results lead to a new interpretation of how stochastic gradient descent optimizes the network: compression by diffusion creates efficient internal representations. They also suggest that simpler stochastic diffusion algorithms could be used during the diffusion phase of training, reducing training time. Additionally, their results show that many different weight values can produce an optimally performing network, with implications for efforts to interpret single units. These results, along with explanations for the importance of network depth and the information bottleneck optimality of the layers, make Shwartz-Ziv and Tishby’s work extremely promising for improving the transparency of deep learning, though their results so far are theoretical and their methods yet to be extended to real-world scenarios involving large networks and complex data.
Opening the Black Box of Deep Neural Networks via Information Despite their great success, there is still no comprehensive theoretical understanding of learning with Deep Neural Networks (DNNs) or their inner organization. Previous work proposed to analyze DNNs in the Information Plane; i.e., the plane of the Mutual Information values that each layer preserves on the input and output variables. They suggested that the goal of the network is to optimize the Information Bottleneck (IB) tradeoff between compression and prediction, successively, for each layer.
In this work we follow up on this idea and demonstrate the effectiveness of the Information-Plane visualization of DNNs. Our main results are: (i) most of the training epochs in standard DL are spent on compression of the input to efficient representation and not on fitting the training labels. (ii) The representation compression phase begins when the training errors becomes small and the Stochastic Gradient Decent (SGD) epochs change from a fast drift to smaller training error into a stochastic relaxation, or random diffusion, constrained by the training error value. (iii) The converged layers lie on or very close to the Information Bottleneck (IB) theoretical bound, and the maps from the input to any hidden layer and from this hidden layer to the output satisfy the IB self-consistent equations. This generalization through noise mechanism is unique to Deep Neural Networks and absent in one layer networks. (iv) The training time is dramatically reduced when adding more hidden layers. Thus the main advantage of the hidden layers is computational. This can be explained by the reduced relaxation time, as this it scales super-linearly (exponentially for simple diffusion) with the information compression from the previous layer.
Bibtex:
@misc{1703.00810, Author = {Ravid Shwartz-Ziv and Naftali Tishby}, Title = {Opening the Black Box of Deep Neural Networks via Information}, Year = {2017}, Eprint = {arXiv:1703.00810}, }