Open HyunkuKwon opened 3 years ago
Goodfellow and colleagues talked about global and local normalizations in their paper. I’m wondering if the paper “Deep Neural Networks Are More Accurate Than Humans at Detecting Sexual Orientation From Facial Images” by Michal Kosinski and Yilun Wang is an application of the concept of normalization?
In the paper by Kosinski and Wang, the Euclidean distances between the landmarks were normalized in order to account for the differing sizes of the faces in facial images.
Is the normalization process used by Kosinski and Wang the same thing as Goodfellow wrote about in their paper? Also, I don’t know if the purpose of normalization by Goodfellow is to help me compare different metrics/units. Can you please explain more about this?
When we train CNNs to identify images, should we seriously consider striding and padding?
Why aren’t convolutions used much in text classification like BERT, GPT-2, or GPT-3 (yet)? e.g. SqueezeBERT
The first couple of sections in the deep learning book chapter are mainly about different hardware implementations for deep neural networks training, deployment, etc. Are their any resources that could be shared with us to learn more about parallel computation and efficient algorithm design to optimize our models?
How similar are the methods employed in computer vision to the methods employed in audio analysis? It seems to me that the two would be quite different, as soundwaves do not seem analogous to pixels. I'd be interested to hear how these two applications of deep neural networks have co-evolved over time.
Could you please give us some examples of applying these methods to inequality topics?
When reading the chapter, I am thinking dealing with images is similar to dealing with matrices, and the big difference is that we change the columns and rows of matrics to pixels, and we change the number in matrics to a series of numbers like RGB, contrast, or some computer vision-specific terms. Is this true? I feel like all models in content analysis finally use numbers to represent everything. Images and videos are of the complicated ones. Does this mean dealing with images and videos need more computing power and are more computationally expensive?
Are there methods that produce contextual embeddings in images, similar to that in language models like BERT? E.g. Embedding an image in context to a series of images, or even embedding a particular object in an image with respect to other objects in the image? This seems like it could be highly useful for content analysis where relationships between entities are often very important.
@k-partha There's this model called PiCANet, which learns to map the pixels of salient objects in an image (they call it pixel-wise contexual attention). I think this might be related to your second idea. Hope it sounds interesting to you!
My question: I'm very interested in the applications of transfer learning in images. What are some fun image projects that make use of pre-trained models (like VGG16)? I know that there are artist identification tasks (for paintings). Are there social science related ones?
I'm also interested in how voice and images may be combined together to have a more comprehensive analysis of video resources. Would this be computationally demanding? Is there any mature way to deal with videos?
I am also curious about if there can be some analysis concerning the context, and analysis among smaller elements of the vision in the field of vision analysis. Like text analysis, we may depart the documents to sentences and sentences to words, so is there any similar analysis concerning visual analysis?
to what degree are the NLP techniques we've discussed in class implemented in audio analysis (where audio is transcribed and analyzed as 'written text')?
To what degree can transformers help with computer visions? I think they have already revolutionized the NLP. I am wondering if introducing the attention mechanism can better capture the fundamental distribution or ideas behind the image?
To me, the most exciting part of ConvNN is its connection with vision and neuroscience. I wonder if there will be more NN models that can reverse-engineer various sensory/cognitive systems that we know so much about.
Like the method of identifying the context of words in texts, I wonder if in image recognition, we can also detect context of an object by its surrounding environment.
Echoing Partha's question, I would also like to hear more about efforts to embed images, audio and even video? Would it be possible to create an embedding for movies and find movies similar to it for example? Or songs?
To what extent is computer vision through deep learning a black box? When using these techniques, in what ways are we able to understand the reasons for image classifications?
Some images, e.g., facial images, are more "structured" than others, then how useful are computational methods in analyzing those less structured images?
How computationally expensive are audio and image processing techniques using neural nets as opposed to NLP using NNs?
For user-generated images on social media, are they open to be scraped by researchers? Will people be less willing to post images of themselves or silly ones if they realize they are being monitored and analyzed?
Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. Chapter 12.2 “Convolutional Networks.” MIT press: 326-366.
Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. Chapter 12.2 “Applications: Machine Vision.” MIT press: 447-453.