Open bhargavvader opened 3 years ago
The chapter presents a variety of activation and loss functions and discusses some high level characteristics that differentiate the functions from each other. Can you provide some more detail as to what motivates our decision to use certain activation and/or loss functions over others? Is it theory as to what is the true data generating function, trial and error of different options, priors as to the "best" option (ReLU?), the type of data on hand, all of the above, etc?
If there is time for a second question: can you clarify what value is on the x axis in the activation function images (p. 27-30)?
Thank you!
As you mentioned in Chapter 1, in deep learning, inputs are converted to outputs in an unknown way. We usually call this the "black box".
However, in the first week of reading, Terrence writes in his paper published in 2020 that the mathematical theory of deep learning will shed light on how they work, allowing us to evaluate the strengths and weaknesses of different network architectures, and leading to significant improvements.
So, I'm really curious how the cutting edge research in mathematical theory of deep learning is progressing. Are there any major breakthroughs? Many thanks!
I talked with a few people I know who are working with Artificial Neural Networks, and they told me that generally activation functions and loss functions aren't particularly important and you just select arbitrarily. Is this true, or do they have more specific use cases than "I'm doing classification" or "I'm doing regression".
I have two questions to share with everyone:
I would like to ditto the questions about how we can select the activation functions/number of units in the layer, etc, better using experiential/mathematical principles and theories. I think this might be the most practical lesson we can get out of the class!
Also, I would like to know if there are any cases where we would prefer multiple sigmoid activations over softmax activations on the output layer when dealing with multiple class classification problems. In addition, would there be any case where we perform multi-class classification and sum it up to make it a binary classification that actually makes the model better? (e.g., classify image as "white bear", "brown bear", "white wolf", "brown wolf" and use it as classification for "bear" and "wolf")
In Chapter 1 of the book, you introduced several loss functions in regression and classification settings. Can you tell us more about how to choose among these loss functions to identify the optimal one depending on the data/context/objective?
I have two questions to share with all, one conceptual and one more practical:
Thanks a lot for this inspirational reading! I have one question about the 'Deep' feature of deep learning. Why does deep neural network work better compared to shallow ones? As we can see, one typical feature of modern artificial intelligence success is they employ very deep model instead of shallow and wide models. Do you have any intuition why these deep models can beat the shallow and wide counterpart? I am also wondering if it is due to the fact that the deep neural net can 'create variables' at initial layers while shallow ones cannot? How can we better understand the variables created by the neural nets?
Thanks!
Thanks for the insightful reading! I am interested in the relationship between embedding techniques and deep neural networks, as it is mentioned that "text, image, graph and network based embeddings" will be discussed. From my perspective, these two branches of approaches can be tried to measure how much information people received and how people link the information to what they have known. The barriers may include:
I also read a paper (Baroni et al., 2014) where LSA is suggested for contextual predicting. Enlightened by this paper, I think maybe other topic modeling algorithms like LDA and LDA with hidden markov model are worthy of trying to explore subtexts as well, combined with word relationships given by WordNet.
I wonder how you think of this issue!
Reference:
It's interesting to think about NN nodes playing an inhibitory role (where they adopt some activation function that contains negative output values, as opposed to the standard ReLU). Are there any classes of functions/real-world applications where these potentially inhibitory activation functions play a critical role in learning? Have any classes of functions been shown to be computationally very inefficient when estimating solely through ReLU?
I have been very interested in the Hopfield network and other non-deep network architectures for machine learning - yet deep learning is indubitably superior in performance. How can we explain the success of deep networks over these types of densely connected architectures? Is the abstraction of 'representation learning' by each successive 'layer' that crucial to the success of deep learning? Are there other paradigms we can think of on which to design neural networks as learning algorithms? Are there tasks non-deep architectures might conceivably perform better on?
It is stated that the link between modern neural networks and biological brains has weakened, but I am wondering to what extent and in which ways this is? While neural network models might not perform computations that are biologically realistic or feasible, it seems like there are methods for creating neural networks that parse and process stimulus in a way similar to biological brains. Hierarchical Convolutional Neural Networks have been formatted to present spiking histograms similar to the ventral visual stream in primates(https://pubmed.ncbi.nlm.nih.gov/24812127/), in which layers caring about some aspect of an image/object (location, orientation, shape, identity) are pooled to create a system of recognition for specific image qualities in increasing complexity. Recurrent Neural Networks have also shown activations that might be comparable to decision-making centers in biological brains.
I am also very curious about the carbon footprint of making large models, and how it is expected to change with time. I have learned that non-fungible tokens (a unit of data on a digital ledger called a blockchain) have shown very large carbon emissions as they are being more heavily utilized. Is the concern a similar one for these high-data deep learning models; is this concern a particularly significant or reasonable one?
I have some thoughts following @hesongrun 's question. As can be perceived, a deep and wide neural network introduces more freedom to the parameters and thus higher flexibility for the model to capture patterns embedded in the data. However, from the view of the canonical bias-variance trade-off, this also introduces the risk of overfitting. I was wondering how do neural network algorithms reduce overfitting in the practice? I can tell that regularization is a commonly used adaption to the loss function to capture the problem of overfitting. But how is it implemented for different loss functions? I am also curious about the difference in the loss functions per see. As I tried in the 2nd notebook of week 1, it seems that models with different numbers of layers lead to the same accuracy score on the test set when using a HingeEmbeddingLoss
loss function. While this seems weird, I was trying to carefully review the code to make sure that the result is true. Generally, since I don't have sufficient expertise with the mathematical foundation of the neural net, especially the loss function and the back-propagation, I am just curious about the difference across loss functions and how do they affect the risk of overfitting.
Post and upvote questions related to the introductory chapter for week 1/2 on Why Data Integration with Deep Learning?