UChicago-Thinking-Deep-Learning-Course / Readings-Responses

1 stars 0 forks source link

Why Neural Methods for Data Integration and Analysis? - Orientation #1

Open bhargavvader opened 3 years ago

jamesallenevans commented 3 years ago

Post and upvote questions related to the introductory chapter for week 1/2 on Why Data Integration with Deep Learning?

pcuppernull commented 3 years ago

The chapter presents a variety of activation and loss functions and discusses some high level characteristics that differentiate the functions from each other. Can you provide some more detail as to what motivates our decision to use certain activation and/or loss functions over others? Is it theory as to what is the true data generating function, trial and error of different options, priors as to the "best" option (ReLU?), the type of data on hand, all of the above, etc?

If there is time for a second question: can you clarify what value is on the x axis in the activation function images (p. 27-30)?

Thank you!

Raychanan commented 3 years ago

As you mentioned in Chapter 1, in deep learning, inputs are converted to outputs in an unknown way. We usually call this the "black box".

However, in the first week of reading, Terrence writes in his paper published in 2020 that the mathematical theory of deep learning will shed light on how they work, allowing us to evaluate the strengths and weaknesses of different network architectures, and leading to significant improvements.

So, I'm really curious how the cutting edge research in mathematical theory of deep learning is progressing. Are there any major breakthroughs? Many thanks!

jsoll1 commented 3 years ago

I talked with a few people I know who are working with Artificial Neural Networks, and they told me that generally activation functions and loss functions aren't particularly important and you just select arbitrarily. Is this true, or do they have more specific use cases than "I'm doing classification" or "I'm doing regression".

cytwill commented 3 years ago

I have two questions to share with everyone:

  1. As both the book the lecture suggests, monotonicity is a common characteristics of the activation functions. However, as I did the first assignment and googled about different types of activation functions. I found some newly invented activation functions that do not follow this rule like GELU and SiLU (see wikipedia). What are the motivations or benefits of using such activation functions?
  2. When designing a DNN of our own, it is always somewhat subjective to decide the number of layers and how many neurons to be used in each layer. Besides doing brutal-force search to find out an optimal structure, do we have any general thumb-rules for the construction of DNN? (I know transfer learning can make the use of some existing powerful network structures and parameters, but not sure what can be referred to if we need to build up a brain-new one....)
nwrim commented 3 years ago

I would like to ditto the questions about how we can select the activation functions/number of units in the layer, etc, better using experiential/mathematical principles and theories. I think this might be the most practical lesson we can get out of the class!

Also, I would like to know if there are any cases where we would prefer multiple sigmoid activations over softmax activations on the output layer when dealing with multiple class classification problems. In addition, would there be any case where we perform multi-class classification and sum it up to make it a binary classification that actually makes the model better? (e.g., classify image as "white bear", "brown bear", "white wolf", "brown wolf" and use it as classification for "bear" and "wolf")

william-wei-zhu commented 3 years ago

In Chapter 1 of the book, you introduced several loss functions in regression and classification settings. Can you tell us more about how to choose among these loss functions to identify the optimal one depending on the data/context/objective?

Dxu1 commented 3 years ago

I have two questions to share with all, one conceptual and one more practical:

  1. Could you elaborate a bit more on what embedding is and does? My immature understanding (which is likely wrong) is that it is a lower-dimensional representation of data/object of higher dimensions.
  2. Is there a general guideline for choosing activation function for certain tasks? In chapter 1, you categorize the loss functions in three main categories: for regression, classification or ranking. Is there a similar guideline for activation function? Furthermore, in practice, how much do we care about differentiability around zero (e.g. Relu vs Softplus)? Could you elaborate a bit more on the implications of mathematical properties as such in a practical context? Thank you!
hesongrun commented 3 years ago

Thanks a lot for this inspirational reading! I have one question about the 'Deep' feature of deep learning. Why does deep neural network work better compared to shallow ones? As we can see, one typical feature of modern artificial intelligence success is they employ very deep model instead of shallow and wide models. Do you have any intuition why these deep models can beat the shallow and wide counterpart? I am also wondering if it is due to the fact that the deep neural net can 'create variables' at initial layers while shallow ones cannot? How can we better understand the variables created by the neural nets?

Thanks!

Yilun0221 commented 3 years ago

Thanks for the insightful reading! I am interested in the relationship between embedding techniques and deep neural networks, as it is mentioned that "text, image, graph and network based embeddings" will be discussed. From my perspective, these two branches of approaches can be tried to measure how much information people received and how people link the information to what they have known. The barriers may include:

  1. the understanding ability varies from person to person so it limits the generalization of our conclusions;
  2. same words in different regions may be another thing. To be specific, same words under different context may carry various meanings and the meanings of subtexts may be ignored, which are very common in literature.

I also read a paper (Baroni et al., 2014) where LSA is suggested for contextual predicting. Enlightened by this paper, I think maybe other topic modeling algorithms like LDA and LDA with hidden markov model are worthy of trying to explore subtexts as well, combined with word relationships given by WordNet.

I wonder how you think of this issue!

Reference:

  1. Baroni, M., Dinu, G., & Kruszewski, G. (2014, June). Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 238-247).
k-partha commented 3 years ago

It's interesting to think about NN nodes playing an inhibitory role (where they adopt some activation function that contains negative output values, as opposed to the standard ReLU). Are there any classes of functions/real-world applications where these potentially inhibitory activation functions play a critical role in learning? Have any classes of functions been shown to be computationally very inefficient when estimating solely through ReLU?

bakerwho commented 3 years ago

I have been very interested in the Hopfield network and other non-deep network architectures for machine learning - yet deep learning is indubitably superior in performance. How can we explain the success of deep networks over these types of densely connected architectures? Is the abstraction of 'representation learning' by each successive 'layer' that crucial to the success of deep learning? Are there other paradigms we can think of on which to design neural networks as learning algorithms? Are there tasks non-deep architectures might conceivably perform better on?

RahmanMustapha commented 3 years ago

It is stated that the link between modern neural networks and biological brains has weakened, but I am wondering to what extent and in which ways this is? While neural network models might not perform computations that are biologically realistic or feasible, it seems like there are methods for creating neural networks that parse and process stimulus in a way similar to biological brains. Hierarchical Convolutional Neural Networks have been formatted to present spiking histograms similar to the ventral visual stream in primates(https://pubmed.ncbi.nlm.nih.gov/24812127/), in which layers caring about some aspect of an image/object (location, orientation, shape, identity) are pooled to create a system of recognition for specific image qualities in increasing complexity. Recurrent Neural Networks have also shown activations that might be comparable to decision-making centers in biological brains.

I am also very curious about the carbon footprint of making large models, and how it is expected to change with time. I have learned that non-fungible tokens (a unit of data on a digital ledger called a blockchain) have shown very large carbon emissions as they are being more heavily utilized. Is the concern a similar one for these high-data deep learning models; is this concern a particularly significant or reasonable one?

luxin-tian commented 3 years ago

I have some thoughts following @hesongrun 's question. As can be perceived, a deep and wide neural network introduces more freedom to the parameters and thus higher flexibility for the model to capture patterns embedded in the data. However, from the view of the canonical bias-variance trade-off, this also introduces the risk of overfitting. I was wondering how do neural network algorithms reduce overfitting in the practice? I can tell that regularization is a commonly used adaption to the loss function to capture the problem of overfitting. But how is it implemented for different loss functions? I am also curious about the difference in the loss functions per see. As I tried in the 2nd notebook of week 1, it seems that models with different numbers of layers lead to the same accuracy score on the test set when using a HingeEmbeddingLoss loss function. While this seems weird, I was trying to carefully review the code to make sure that the result is true. Generally, since I don't have sufficient expertise with the mathematical foundation of the neural net, especially the loss function and the back-propagation, I am just curious about the difference across loss functions and how do they affect the risk of overfitting.