Deep Architectures, Training & Taming -Orientation

lkcao commented 2 years ago

Post your questions here about: “Training and Taming Deep Networks” OR “The Expanding Universe of Deep Learning Models”--Thinking with Deep Learning, chapters 3 & 4.

thaophuongtran commented 2 years ago

Question for Chapter 3 - Training and Taming Deep Models: In this chapter we learn about different optimizers including Resilient Backprop and Stochastic Gradient Descent. As I read through different variations of optimizers, I'm curious whether some optimizers are computationally lighter or perform better compared to others?

JadeBenson commented 2 years ago

Thank you! This was helpful to know more about the parameters we used in our last homework and how we might expect them to work/affect performance. I had a question about Chapter 4 figure 4-1, though. I'm still a little unclear why those groupings were chosen and what aspects group the architectures together in that structure. I think this "child of" description is really interesting, but it was a little hard to trace through this chapter why certain models would be described as "children of" or what the most meaningful characteristics were for this classification diagram. I think these sorts of schemas are so helpful since they give context and relationality between these architectures, and I'd just love to see these groupings and decision choices a little more clearly. Thank you!

isaduan commented 2 years ago

Could you please explain the regularization strategy of 'pooling nodes/layers' in a bit more detail? The text reads like "coarse graining information signals from input data by retaining some function of the features from inputs (e.g., max, average, min), but collapsing multiple nodes from a prior layer into a single node in the next one."

pranathiiyer commented 2 years ago

Could you please explain the regularization strategy of 'pooling nodes/layers' in a bit more detail? The text reads like "coarse graining information signals from input data by retaining some function of the features from inputs (e.g., max, average, min), but collapsing multiple nodes from a prior layer into a single node in the next one."

Adding onto isabella's question, regularization in keras seems to have three components--kernel, bias, and activity regularization. How does this compare against regularization used in pytorch? I understand that kernel regularization is primarily used for the weights, but I still don't understand what roles each of these play and how one must opt for them?

borlasekn commented 2 years ago

At the end of Chapter Three, after running through ways to build Neural Networks that will learn patterns, the text states that doing this work will allow a given Neural Network to "become a general purpose tool". I was hoping to get a bit more clarity on this. Does this mean that these Networks become generalizable to a certain degree? Or that they can just be used to provide more general insight on a given topic because they learn in ways that transcend the data?

sabinahartnett commented 2 years ago

Thank you for sharing these chapters! They have definitely been very grounding for some of the applications we're reading in the possibility readings. I am hoping you could expand a bit on Transfer Learning (Chapter 3 describes the value of borrowing/inheriting weights from a NN trained on 'similar' data) - how are use cases determined for Transfer Learning? How is 'similarity' defined here? I've heard a lot of hype around Transfer Learning in a number of Deep NN NLP applications and am curious to learn more about how to determine appropriate moments to apply transfer learning across data / models / applications?

egemenpamukcu commented 2 years ago

Thanks for sharing the chapters, they were really informative. I think I would like to hear more about the connection between data sparsity and network architecture/hyperparameters. How and why sparse data affect these choices?

mdvadillo commented 2 years ago

In Chapter 3, when we talk about the various ways of implementing gradient descent, the textbook specifies that certain algorithms lead to finding a global minimum, while others do not have that specification. I was wondering if it is possible to stumble across local minima while training the model, or if there is a guarantee that the minimizer is unique/will it affect the result if it wasn't unique?

ValAlvernUChic commented 2 years ago

Thank you for sharing the chapters! They really gave quite a great view of the different ways we can play with the models for the best performance. I was hoping to get a bit more clarity when the book says "Some RNNs use the architecture of recurrence not to model sequences, but for its other properties." Specifically, intuitions behind these "randomly connected computational reservoirs" and what they're mainly used for.

BaotongZh commented 2 years ago

In chapter 4, the book mentioned the self-attention network, which is used to deal with Sequence to Sequence problems. However, a paper(https://arxiv.org/abs/2005.12872) mentioned that the self-attention network could also be used to solve some image problems. So, what are the differences between the Self-attention network and CNN?

javad-e commented 2 years ago

In the regularization section of chapter 3, there is a brief discussion on complementary outputs: “Requiring a model to fulfill multiple tasks or produce multiple model outputs can regularize hidden layers that are shared between those tasks if they are complementary and contribute to the discovery of shared signals within the internal representation”. Could you explain how this results in regularization?

sudhamshow commented 2 years ago

1) When we use GPUs for training Deep Learning Networks, what exactly is it that is computed in parallel? Is the calculation of weights associated with each hidden node local? If yes, can this be computed in parallel? 2) We've seen a couple of techniques for network regularisation including dropping of nodes (Dropout). Can this be extended to random removal of links or random shuffling of links between nodes of different layers? Does this improve performance?

Hongkai040 commented 2 years ago

I am kind of interested in those searching strategies of hyper-parameters. I am interested in the pros and cons of those strategies. For example, does Population-Based Training implies that it requires huge amount of memory(which may not suitable for large language models) and coordination of training of different models? Is Bayesian Optimization's mechanism is similar to the machine learning model Naive Bayes? I'd appreciate comparison of pros and cons of those methods!

Yaweili19 commented 2 years ago

In Chapter 3 we went over the training and taming process of neural networks. I can get optimization and regularization, but am not sure about what Initialization step was doing. In our homework for the first week, we did build a few models without explicitly choosing our initialization options. When should we use what approach? What are their pros and cons, and examples? I'd like to know more about those.

min-tae1 commented 2 years ago

The section on regularization on chapter 3 suggests that smaller batch sizes reduce test error. Then what would be the appropriate batch size for each case, since batches to small would also lead to problems such as requiring lots of computational power.

yujing-syj commented 2 years ago

I have two questions about chapter 3. First, when we using different activation function, is there any difference of the way of choosing the initialization? Second, for the Hyperparameter Search Options, whether the Bayesian Optimization is the most common method we could use since it has low cost, and it is also based the history of result?

zihe-yan commented 2 years ago

Concerning Chapter 3, I am still a bit confused about the definition of sparse. Clearly, a corpus can be counted as sparse. But what should we choose when it comes to other more nuanced datasets. For example, in this week's tutorial 1, the Covertype dataset has a variable that is described by 40 different columns with a binary coding for each. Would such a dataset be counted as sparse? Also, I think this type of coding is very typical in social science studies. How can we place this type of dataset in this powerful tool map laid out for us in Chapter 3?

chentian418 commented 2 years ago

The ideas of gradient descent and backpropagation is very elegant and clever in neural network. I am curious about the learning rate by which backprop descends the gradient. Can this learning rate be controlled by some hyperparameters and what are some determinants of it? As the improvements brought by gradient descent do not always help, how can we systematically explore the scenarios where optimization really improves our tasks?

yhchou0904 commented 2 years ago

In Chapter 3, lots of optimization and regularization methods are introduced. When talking about augmenting data with randomness, the author talks about adding random noise, variation, and mixed-up. I am a bit confused about what's the difference between bootstrapping and adding random noise? I know bootstrapping would sample bias from the data's empirical distribution but not totally random, but what is the intuitive explanation of the difference?

Emily-fyeh commented 2 years ago

Combining the content of Chap 4 and the ideas from the possible readings this week. Are there preferences/common practices for constructing deep learning model architectures in the computational social science field? (For example, as of my understanding, maybe a shallower and broad model would be more intuitive for interpretation?)

hsinkengling commented 2 years ago

Out of pure curiosity, what does the mapping of tasks and neural network models mentioned in Ch. 4 looks like? How common is it for a task to be achievable by different models with similar levels of performance? or is it that each kind of task usually maps onto one specific model and that model alone is suitable for the task? How specific should we think about the mapping of tasks and models?

chuqingzhao commented 2 years ago

In Chapter 3 it introduces the concept of batch size, and also in last week's codes we also use different batch size and epoches train the model. I am little bit confused what is the difference between the two concepts. How should we find the optimal batch size/epoches?

linhui1020 commented 2 years ago

I am confused about seq2seq model, as the document tells this model could be used for language translation and it employs attention mechanism to find the where the attention most reflected in each step. I wonder whether we can use seq2seq model in understanding the relationship between an individual's programming language and native language.

Thinking-with-Deep-Learning-Spring-2022 / Readings-Responses

Deep Architectures, Training & Taming -Orientation #2