UChicago-Thinking-Deep-Learning-Course / Readings-Responses

1 stars 0 forks source link

Chapter 2 on Optimization, Initialization, Regularization and Model Architectures. #16

Open jamesallenevans opened 3 years ago

jamesallenevans commented 3 years ago

Post and upvote questions related to the next available chapter on optimization, initialization, regularization and model architectures for week 4/5 posted here. Note, that I also welcome comments to the text, especially regarding options (e.g., regularizers) that I did not include or should not have included! If you make a suggestion that leads to a substantial new addition to the chapter (or the deletion of something that doesn't belong)--e.g., a change in discussion of an algorithm in text--you will receive extra credit!

Raychanan commented 3 years ago

While reading this chapter, the question I felt most curious about was how I should determine the best optimization method, loss function, architecture, etc. In my previous practice, I just tried different setups. I did read some papers where the authors mentioned that they tweaked various parameters to finally determine a method of setting the parameters that performed optimally. As I understand it, this process is like doing permutations. But wouldn't this approach to model training be infeasible if the number of parameters is particularly large? More importantly, this method seems too tedious. Is it possible to avoid this "permutation method" and choose parameters in a more formal and standardized way?

pcuppernull commented 3 years ago

When engaging in regularization, optimization, etc to improve the performance of the model, it seems like there may be some sort of performance “ceiling” that regularization and optimization can get us to (conditional on the amount of data we have on hand, for example). How can we guess where that ceiling might be? In other words, if I want my model to perform at a certain level, how can I make an educated guess on whether I can get there with regularization/optimization? Is there a way to infer that regularization/optimization won’t get me far enough, and therefore I need to explore different model architectures and/or gather more or different data?

cytwill commented 3 years ago

Thanks for sharing these important perspectives on compiling a neural network. There is really much to learn from different types of neural networks in this chapter. I have several questions and suggestions so far:

  1. Among those different optimizers, I felt that SGD with momentum and Adam are currently the most popular ones used in many tutorials. But methods like AdamW, AdaMax, and Nadam seem to have more advantages than Adam according to the motivation of their inventions. So why are they less practiced (If I am wrong, please correct me!)? Are they only superiors on certain data types or data with certain characteristics? Or maybe the improvement from these methods can not offset the increase of computational complexity?

  2. For regularization methods. A general question is that when can we know that we need a/more regularization? I guess if we find the model is overfitting the data, we might try regularization methods, but if we hope to better the model performance, should we first consider making the model more complex or simple (regularization)? Also, as there are different regularization methods of different complexity, should we start with those simple ones (like L1, L2), and would combining different regularization methods a good idea or not (would this cause some over-regularization issue)?

  3. For different types of neural networks, I am curious about whether they all mimic the real cognitive processes in the human brain, that is to say, there are corresponding structures in human neurons. Also, though mentioned for some NN types, I think it would be helpful if you could provide a table mapping different NN types to the tasks and data types/structure that they are fitted into, so the readers can have a quick understanding of their functionality.

  4. Some other applications of NN: Using NeuralODE models to predict ODE functions/to predict a process, the authors reported superior performance than RNN in their paper.
    Neural Ordinary Differential Equations

jsoll1 commented 3 years ago

I'm feeling a lot of choice overload with the different neural network architectures (as well as everything else). Their individual specialties seem a bit more high level than how I'm accustomed to working. Are there more general rules of thumb?

nwrim commented 3 years ago

I think my classmates already posted very similar questions, but I think it will be great if we can have some wisdom on how to choose the "animal" in the gigantic menagerie we walked through. Specifically, as I scrambled through conference proceedings to look for the possibility readings each week, I see that there is some innovation/upgrade/ensemble of methods every year. How do we balance between trying to use a cool adaptation of the basic models we have seen, or sticking with a simple model that has been around for a while? I am asking this as a more social-science guy, as I feel that social scientist's job is often making use of the cool things that AI/DL guys made on actual society data rather than trying to make an architecture that tries to cut edges.

william-wei-zhu commented 3 years ago

Are newer, more complicated architectures always better than older, simpler architectures in terms of performance?

Yilun0221 commented 3 years ago

I think sometimes it is hard to tell which DNN model will perform better for a certain data set. Shall we try one by one when we are considering employing a DNN model? Are some data sets inherently more suitable for using certain DNN models?

k-partha commented 3 years ago

Finding the optimal hyperparameter combinations is really important for deep learning models but can also be really time consuming as each may take hours to train even to just fine tune. Are there are any search strategies/algorithms/best practices/model specific advice resources to find the best hyperparameter combinations for training DL models? Grid searches (carpet bombing) are often just too inefficient. In practice, how often do we encounter non linearities in the search space/response for any single hyperparameter?

hesongrun commented 3 years ago

Thanks for the inspirational reading! What do you think is the most important factor we should pay attention to when constructing a neural network? Network structure? Regularization techniques? or Optimization algorithms? Thanks!

bakerwho commented 3 years ago

This was a great reading! I'm curious what types of analogous mechanism like 'attention' we can think of in exploring the space of new architectures. Even the LSTM implements a really interesting and intuitive logic to time-series data. What about 'creative' architectures? What about 'explainable' ones (perhaps neurons that estimate a confidence on the outputs of other neurons)?