Open lkcao opened 10 months ago
In Section "12.2.1.1: Contrast Normalization" of their work, Goodfellow et al. discuss 2 preprocessing techniques -- Global Contrast Normalization and Local Contrast Normalization -- to standardize pixel intensity values across images. Given that GCN normalizes pixel values across the entire image and LCN does so within localized regions, why is GCN less effective at enhancing prominent image features such as edges and corners, whereas LCN is able to address this issue effectively, especially considering that features like edges and corners are typically perceived as globally significant attributes rather than localized phenomena?
Echoing @yuzhouw313 's question. Can we say LCN can highlight features because it is normalized in each small region s.t. the special feature will not be wiped out?
I'm trying to build connection between computer vision and text embedding we did:
Looks like the preprocessing of the image (or signal) can lead to information loss (due to low resolution) -- I never thought about this when we create embedding for text: Is there information in text overlooked in word embedding?
Can we have data augmentation for text corpus just like rotating/zooming images during preprocessing? Or generally, what we can do when we have limited data size?
With the increasing computational demands of deep learning models, what innovative approaches could be developed to optimize their energy consumption and minimize their environmental footprint, especially in large-scale applications? How can applications in speech recognition and computer vision be specifically tailored to improve emergency response systems, such as enhancing the accuracy of real-time speech-to-text translation for emergency calls in diverse languages or analyzing live video feeds for immediate threat assessment?
The textbook focuses on CNN, which is useful but also somewhat obsolete. Transformer architectures are quite popular in Computer Vision nowadays. Can you talk about how transformers work for Image Recognition and Image Generation tasks?
I get the transformer architecture for text, but very confused when it comes to images.
As a part of the contrast normalization procedure, is it possible to use the contrast information to assess features of the image that are often represented with different levels of contrast (e.g., relative distance from the lens)? I’d imagine so, but I’m not sure how complicated it is to perform such an analysis.
The textbook is clear in explaining the basic concepts of computer vision, among which the most intriguing one is augmentation. It says that we can have different instantiations of the model vote to determine the output, which is quite similar to the ensemble method. I am curious about whether there's a tool for guaranteeing each version of representations of the same data is optimized and thus improves the efficiency of augmentation.
In relation to efficient GPU code, they write "Typically, one can do this by building a software library of high-performance operations like convolution and matrix multiplication, then specifying models in terms of calls to this library of operations. For example, the machine learning library Pylearn2 (Goodfellow et al., 2013c) specifies all its machine learning algorithms in terms of calls to Theano (Bergstra et al., 2010; Bastien et al., 2012) and cuda-convnet (Krizhevsky, 2010), which provide these high-performance operations." where would you recommend we start and what is the most applicable library for learning? it seems like there is a lot of documentation, but some tutorials and documentation is substantially better than others. is there a good reference library for those looking at social science research
The reading is very interesting. I wonder how do the machine learning model--especially deep learning techniques--tackle the challenge of link prediction in knowledge graphs when most of the knowledge bases are manually curated and are likely to miss a bulk of true relations?
Given the processing power required to execute image processing, and LMMMs in general, is there a fear that we are going to soon hit the limit of what is fiscally or reasonably possible within this research space in the near future? Over the last couple of years, there's been talks that we're approaching the end of Moore's Law being applicable. LMMMs and LLMs are fascinating in their capabilities, but I do worry that we're getting close to the practical limits of what we can do, or what is ethically worth pursuing with regard to these models.
In reading the "12.5.1 Recommender Systems" section, a major challenge for recommender systems is how to handle new users or new items (i.e., the cold start problem), as they lack interaction history with existing users or items. To mitigate the cold start problem in recommender systems, could we explore the potential of using typing input methods from users to gather latent context information to enhance recommendation quality?
I'm interested in understanding the tasks that GCN and LCN are suitable for. Are these techniques mainly used in the detection and classification tasks? What are the preprocessing steps for prediction and generation?
In the context of the first chapters, how might future advancements in their architectures improve their efficiency and accuracy for complex datasets? What implications could these enhancements have for critical applications like autonomous driving and medical image analysis?
My question is similar to Yunfei's. I was also interested by the data augmentation that the authors mention at the beginning of the section on preprocessing. it seems that the ability to train on multiple instantiations of the same image is very costly in terms of finding images and computationally in terms of training a model. What is a typical budget for projects that incorporate LMMMs like? Understanding that projects vary vastly, I am curious as to how researchers go about weighing computational and funding costs and data quality.
How do Graph Convolutional Networks (GCN) adapt the principles of convolution from traditional image processing to graph data, and what challenges are associated with applying these principles to the irregular structure of graphs?
In the Recommender Systems section, it is mentioned that we can use collaborative filtering to enhance the model's understanding between the user and the item. This is a good idea. I am curious if we can also utilize the relationships between users and users, as well as items and items, to further the model's understanding and achieve better results.
In the section on data augmentation, it's mentioned that enhancing a classifier's generalization capability can be achieved by enlarging the training dataset. This involves adding modified versions of existing training examples through transformations that do not alter the class of the data. This leads me to question how we can apply similar techniques to text data. Is there a method for augmenting text data to improve model performance?
I want to know how in the natural language processing section, the computer recognizes ambiguous content, such as rhetorical tone, or nouns with multiple meanings (song names, place names, people's names).
The computervision is an area that is completely new to me. I'm quite confused about the argument that "applications of computervision range from reproducing human visual abilities, such as recognizing faces,to creating entirely new categories of visual abilities." What does the "entirely new categories of visual abilities" mean? Is it something totally beyond human visual capabilities?
How might the advent of quantum computing impact the development and application of deep learning models in the areas outlined in the document, such as computer vision and speech recognition?
The method of creating corrupted versions of true facts to serve as negative examples is intriguing. How does the choice of entity for corruption impact the model's learning dynamics, and are there alternative or more sophisticated methods for generating such negative examples to improve the model's predictive accuracy?
I would like to know how the differences in cultural context would influence LMMMs' recognition. For example, in different cultures, a similar pattern may represent completely different meaning.
With the ongoing development of GPU architectures, how might the strategies for optimizing deep learning computations on GPUs evolve, and what would this mean for the acceleration of model training and inference?
How might we explore the effects of different initialization techniques on convolutional networks' training speed and final performance, mainly when applied to datasets with highly irregular and complex patterns?
Is it possible to apply this strategy in the realm of video games developing?
I think CNN has been a huge breakthrough in the field of artificial intelligence. What could be some of the applications of CNN in psychology?
One question I always had is how to design a neural net architecture. I understand the function of iniviudal layers, but I am having trouble to design a complete net myself. For example, how many Convolutional layers before pooling? I was wondering is there any specfic cookbook or theory to follow? or is it more a trial-and-error thing.
Honestly, from the readings alone, I cannot say I undertood exactly how CNN is so much different than GCN or LCN.
During data augmentation in CV, are transformed images visually cohesive (taking a single frame and developing versions of the same photo objects from new but realistic angles) or realistic in maintaining the central objects while changing the context or the location of the object?
how the scale of neural networks impacts their performance and the complexity of tasks they can solve? Additionally, what are some challenges associated with scaling up neural networks, and how are researchers addressing these challenges to push the boundaries of deep learning further?
Given that deep learning models now possess the complexity and size comparable to an insect's nervous system, yet exhibit significant capabilities in pattern recognition and decision-making, what are the potential ethical concerns and implications for social sciences in terms of privacy, autonomy, and decision-making in societal contexts where such models might be applied?
Given the depth and intricacy of the reading, I'm curious about the sophisticated strategies machine learning models, particularly those employing cutting-edge deep learning techniques, deploy to navigate the intricate challenge of link prediction within knowledge graphs. How do these advanced computational models bridge this gap, ensuring a comprehensive and accurate representation of connections in such an expansive and ever-evolving informational landscape? In other words, can they fill in the blank?
Post questions here for this week's fundamental readings:
Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. Chapter 12.2 “Convolutional Networks.” MIT press: 326-366.
Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. Chapter 12.2 “Applications: Machine Vision.” MIT press: 447-453.