Open jamesallenevans opened 2 years ago
Question on "Dropout: A Simple Way to Prevent Neural Networks from Overfitting": I thoroughly enjoyed reading this article from the methods to how it was motivated from a theory of the role of sex in evolution (Livnat et al., 2010). I got a bit confused when the authors explained dropout as adding noise to the hidden units and how adding noise can be useful for unsupervised feature learning and supervised learning problems.
In the article “ Understanding the role of individual units in a deep neural network” by Bau et al. (2020), the authors describe that, by adding units specified to doors, doors can be added onto images with this neural network when there are buildings or other objects that usually observed with doors, but will be difficult if attempting to add doors to images of things irrelevant to doors, such as images of sky. It seems that units in trained neural networks not only learn characteristics of certain objects but also learn some correlations between objects? How could this happen, and how could we get rid of this in practice?
The Dropout article was very interesting, particularly in its use of metaphors to describe the approach. This is definitely a lesson that I'll take-away - including a powerful, simple, and accurate metaphor allows for greater interpretability and understanding of a complicated method. At the end, they discuss how this approach is more accurate across applications, but that it can take up to 2-3X longer. I'm generally curious about how we can improve not only the accuracy but the time of our deep learning methods? I ran into that problem in the homework, too. Is there a way to parallelize these methods effectively? Do we just need to think carefully about our approach first to make sure we avoid unnecessary refitting?
On "Graph Structure of Neural Networks", what is the intuitive interpretation of the finding that "neural network’s performance is approximately a smooth function of the clustering coefficient and average path length of its relational graph"? For example, 'clustering coefficient' measures the degree to which nodes in a graph tend to cluster together - does this mean we prefer to have a few kinds of dense message exchanges vs. many sparse exchanges, that we would prefer the form of "echo-chambers" among neurons?
In "Understanding the role of individual units in a deep neural network” by Bau et al. (2020), they state that "understanding the roles of units within a network allows us to create a human interface for controlling the network via direct manipulation of its units" (pg. 30077). I had more of a philosophical question, that I'd like anybody's insight on:
If we are able to take a peak insight the so-called "black box" of Deep Neural Networks, does this change our understanding of how to define these methods?
My understanding of Deep Learning has always been focused around the idea of Simulating or Imitating reality, seeing how far from human input (i.e. how deep) layers can go to truly understand phenomena from a lens simply not possible from the human eye. This seems to still applies here, but does the fact that this allows for human model manipulation change the fact that reality is being simulated? Couldn't this possible increase the ability for models to be biased because they can be manipulated for certain results?
In the Lottery Ticket Hypothesis, the authors find that dense, randomly-initialized, feed-forward networks contain subnetworks (described as winning tickets) that can be pulled out of a complex model and trained in isolation (while still matching the accuracy of the more complex model). The authors are essentially arguing that models are 'dramatically overparameterized' and can be pruned significantly while still preserving accuracy. This means that networks naturally learn simpler representations than indicated by the level of parameterization.
Does this mean that the parameters that are pruned / are members of this 'overparameterization' will converge to a weight of 0 if kept in the model? And what does this mean for their weight initialization (for my own understanding of how parameter weights work - will a less predictive weight always approach zero (and just take longer if initialized to a large value) or does the 'value' of the weight depend on its initialization?
"Understanding the role of individual units in a deep neural network” talks about breaking down and understanding units of neural nets differently. I was interested to know how this extraction of features and meanings varies across media such as text, audio etc. ?
Is it possible to use intuition to decide how many dropouts to use, where to use them (in which layers) and with what dropout rates based on model architecture, the task at hand, or dataset? Is the improvement we get from using dropout layers related to the depth or dimensions of the network, or the amount of training data we have?
Bau et al. in "Understanding the role of individual units in a deep neural network" indicate that context is necessary for the CNN and the GAN to correctly classify images, and they show that by removing context-specific visual cues or adding context-specific visual cues, the models can change the classification of the image. I wonder how much out-of-context data would you need to add to an image in order for the classification to with from its original classification to that of the out-of-context data. Say, if you were to clip a picture of a plane onto a dress collage style, how much plane-related data must you add to the image to change the classification from 'dress' to 'plane'?
In "Understanding the role of individual units in a deep neural network”, the article introduced how we can interpret the neural network and how each neuron works in the classification task. My question is, whether the ways that the model classifies a certain image are different from the ways that human classifies a certain image. If they are different, how do we explain such a phenomenon?Can we say that such differences indicate the intellegence of machine?
Question on the Dropout paper: there are some other potential techniques for avoiding overfitting, too. As an example, we could divide data into multiple samples and use cross-validation. I was wondering what the advantages of dropout are and when is it the optimal solution compared to other potential techniques.
I am interested in the adversarial attacking issue mentioned in "Understanding the role of individual units in a deep neural network". I think I have two question here. First, if we can understand every unit of a DNN model, does that mean we could understand the model as a whole? Second, the paper applied their framework to find units that lead the problem of adversarial attack, so how can we alter the DNN model to precisely change those specific units?
The conclusion of the “Dropout: A Simple Way to Prevent Neural Networks from Overfitting" article mentions the trade-off between overfitting and training time. What would be ways or heuristics for a researcher to decide on an apporpirate method? What information in this case is required for the researcher's judgment?
I think all of the four papers of Week 2 are very interesting. The one that interests me most is the paper titled Understanding the role of individual units in a deep neural network. It is very useful to identify important units in deep learning models, though, the interaction between important units is also very crucial. Building on this work, I am wondering how can we further investigate the communication between units in the networks?
Is it possible to use intuition to decide how many dropouts to use, where to use them (in which layers) and with what dropout rates based on model architecture, the task at hand, or dataset? Is the improvement we get from using dropout layers related to the depth or dimensions of the network, or the amount of training data we have?
Was wondering about this too! In addition, when should we elect to use dropout layers in place of other regularization techniques?
In the THE LOTTERY TICKET HYPOTHESIS: FINDING SPARSE, TRAINABLE NEURAL NETWORKS article, the winning tickets, or subnetworks that are trained in isolation from dense, forward-feed, and randomized networks, have higher or comparable accuracy when compared to the original, and is trained faster. The authors then listed the limitations of this approach: only smaller datasets, vision-centric, are investigated, and it is computationally expensive.
I do wonder that with these pros and little cons, is the winning-lottery approach becoming the to-go method for deep learning? Is the authors' discussion on its limitation comprehensive?
I have few questions about “[Dropout: A Simple Way to Prevent Neural Networks from Overfitting". First, after dropout, whether the cost function change since after every iteration, some of the nodes are dropout? Is this a problem we need to care about? Also, which kind of inputs are more suitable of using dropout when we are building the model? Is that some cases that we don't bother to use this method?
In Dropout: A Simple Way to Prevent Neural Networks from Overfitting, the authors proposed a model which may increase the learning ability of networks by adding randomness into the hidden layer. This idea of adding a Bernoulli distribution for substituting the dropped out unit is particularly interesting. But I wonder how this type of model would work for an input that contains not only one category of data but a combination of multiple, such as an input that takes both network and speech data? Since intuitively, these are two types of data that vary greatly in how they are organized at the very beginning.
The paper Understanding the role of individual units in a deep neural network truly contributes to the human interpretation of CNN and GAN of image inputs. The result is also promising, successfully distinguishing the main parts which models consider important. The semantic painting of GAN shows that the model may have a similar 'understanding' of object features as humans. I am really curious about the adversarial case presented in the last part. I wonder if we can further train the model with the original and noisy pictures of the ski resort and bedroom, and inspect if the key units change.
The papers "Dropout: A Simple Way to Prevent Neural Networks from Overfitting" and "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks" are both talking about how to apply randomness to model training so that we can prevent overfitting. I also heard that when we are doing fine-tuning of hyper-parameters, a random search would usually have a better performance than doing a sparse grid search. It seems that we rely on randomness to have a better result, does this benefit comes from being closer to the real world? Or does it come from some good luck?
"Graph Structure of Neural Networks" introduces the concept of graph representation for understanding the neural networks. I wonder how this idea can be applied to social science question.
The paper Graph Structure of Neural Networks (Links to an external site.) is interesting in a sense to intuitively view the neural network as a relational graph, especially with measurement from network science, such as average path length, etc. My question is, can the graph representation be used to evaluate the prediction power in a specific question, and can the graph be used to explain the relationship between features and the predictive objective?
In the Bau et al. piece, I noticed that "head" is classified as "part" of a person, while things like "bed" are classified as an "object", instead of, say, part of a room. Is part vs object a standard distinction in image recognition papers? How should we understand the relationship between part and whole in image classification/generation?
In the paper Dropout: A Simple Way to Prevent Neural Networks from Overfitting, it's very novel to learn to idea of approximate the effect of averaging the predictions of all the "thinned" networks by simply using a single unthinned network that has smaller weights. As the prior method of model combination is computationally expensive, the easy approximation of combining the single neural net at test time without dropout with p, the probability at which a unit is retained during training, looks more plausible. I am curious about the choosing process of p, how much distinctions will be between using a validation set and simply setting at 0.5? And how should we consider the choice of p in different categories of tasks?
In "Graph Structure of Neural Networks", the authors combines the concept from social network analysis and deep learning neural networks, leading to a new perspective to investigate the relationship between networks. My question is that how such structure could be used to predict the dynamic relationship, a time-variant relationship? is that feasible?
There has been some literature studying Dropout as an alternative explanation of regularisation with ensembles of thinned networks. The paper alludes to this as well. But I see several differences - 1) The models in a Bagging model are independent of each other whereas the parameters are shared in the Dropout technique 2) Each individual model converges to learn on its own dataset in the ensemble technique unlike the Dropout method where the parameters are shared With all this is it still fair to draw parallels between ensemble learning and Dropout regularisation?
Pose a question about the one of the following possible readings: “Dropout: A Simple Way to Prevent Neural Networks from Overfitting (Links to an external site.)”. 2014. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov. Journal of Machine Learning Research 15: 1929-1958; “Understanding the role of individual units in a deep neural network (Links to an external site.)”. 2020. D. Baua, J. Y. Zhua, H. Strobeltc. PNAS 117(48):30071; “Graph Structure of Neural Networks (Links to an external site.)”. 2020. J. You, J. Leskovec, K. He, S. Xie. ICLR 119:10881-10891; “The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks” (Links to an external site.). 2019. J. Frankle & M. Carbin. ICLR.