9. Images, Art & Video - fundamental

JunsolKim commented 2 years ago

Post questions here for this week's fundamental readings: Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. Chapter 12.2 “Convolutional Networks.” MIT press: 326-366.

konratp commented 2 years ago

One thing I found myself struggling with throughout this class (granted, in some weeks way more so than others) is the question whether or not employing certain computational analysis methods is actually beneficial to researchers over using their own brain and judgement. For example, for some of the previous papers we have encountered, I thought that a qualitative analysis of a smaller subset of data but conducted by a human and not a computer could have yielded similar if not better results.

The authors contend that "vision is a task that is eﬀortless for humans and many animals but challenging for computers." Given the current state of development of these methods and that it takes a lot of preprocessing and preparing to get results, are we at a stage where computer vision methods provide abilities to research that humans can't achieve just by themselves? When is it better to not opt for the computer vision method and instead opt for a human-centered method? Or is my understanding of this as a dichotomy -- you can do either one or the other -- flawed should we simply combine the best of both worlds?

facundosuenzo commented 2 years ago

One thing I found myself struggling with throughout this class (granted, in some weeks way more so than others) is the question whether or not employing certain computational analysis methods is actually beneficial to researchers over using their own brain and judgement. For example, for some of the previous papers we have encountered, I thought that a qualitative analysis of a smaller subset of data but conducted by a human and not a computer could have yielded similar if not better results.

The authors contend that "vision is a task that is eﬀortless for humans and many animals but challenging for computers." Given the current state of development of these methods and that it takes a lot of preprocessing and preparing to get results, are we at a stage where computer vision methods provide abilities to research that humans can't achieve just by themselves? When is it better to not opt for the computer vision method and instead opt for a human-centered method? Or is my understanding of this as a dichotomy -- you can do either one or the other -- flawed should we simply combine the best of both worlds?

Yep, excellent question. My background is primarily in ethnography and qualitative methods, so I often struggle with the same question throughout this class. I also wondered if these computer vision methods can be conceived in terms of an abductive analysis; or if we can think of a"computational grounded theory" and split the process into the first phase of human-centered research and a second phase machine-centered (to gain scale, among other benefits in terms of explanatory power). A solution to this dichotomy could be found in the approach taken by Nelson (2021), who combines different epistemologies to frame her research on feminist social movements.

GabeNicholson commented 2 years ago

Given the large computing component with these models, can someone explain to me how TPUs work and what their benefits are over GPUs? I think James mentioned in class that one of his Post Docs was looking into getting one of them for his own project. I don't think I've heard of them until then and also I noticed google collab has them as a computing option.

ValAlvernUChic commented 2 years ago

I asked this in the first week but it seems most relevant here - "While this assigned paper was largely focused on text-based data, I couldn't help but think about memes as a communicative platform (in some of my circles it's the main communication method, unfortunately). Intuitively, negotiating textual cues with the often contextually-saturated images that they caption would seem extremely difficult. At the same time, in a digital age when these mediums increasingly dominate political and social discourse, they seem incredibly important to theorize. A paper by Dimitrov et al., "Detecting Propaganda Techniques in Memes", did something interesting but they still relied on manual annotations of memes before using them. I was wondering about 1) the advances in methodological approaches in this multi-modal space - specifically whether anyone has had success with unsupervised approaches and 2) How we might even approach this if the context is so latent - I know why Will Smith is crying in that photo and the type of text most appropriate for it, but how can we get the computer to know too?"

Basically the same question - are there methodologies that can integrate the text in memes and the actual latent context found in the images?

ValAlvernUChic commented 2 years ago

Given the large computing component with these models, can someone explain to me how TPUs work and what their benefits are over GPUs? I think James mentioned in class that one of his Post Docs was looking into getting one of them for his own project. I don't think I've heard of them until then and also I noticed google collab has them as a computing option.

I can't tell you exactly how they work but on whether they're better or not... i tried using them once and they took just as long as the GPU to process my data. I was likely using them wrong but given that it was to process tensors, I can't imagine why!

pranathiiyer commented 2 years ago

I'm not sure if this question is in resonance with the readings completely but I guess it makes sense for the week. A lot of social science studies specially those in psychology still depend on manually transcribing video and audio data for research. Given how computationally intensive these processes can be, how do we see them being adapted by independent researchers and organisations ?

sudhamshow commented 2 years ago

I tried searching answers for this online, but I couldn't find one that could explain intuitively - How is 'attention' handled for computer vision tasks. It makes much sense thinking about a sentence and the context around a word. How do we translate this analogy to computer vision? Vec2Seq attention models use this to generate caption or annotate an image, but how is this done? Is this still a context aware model? Is the attention around a pixel? A max pool? How does the model learn what to focus on?

NaiyuJ commented 2 years ago

What kind of social science questions can be perfectly solved by computer vision? I was thinking of any examples that cv may help social scientists improve their research. But it also seems that while computational methods evolve, the research question that we can come up with also evolves.

isaduan commented 2 years ago

How much does computer vision add to text analysis, balanced with the much higher computes it requires? What's unique about vision - and important for the social life - that's missing in the text?

Jiayu-Kang commented 2 years ago

Given the knowledge of techniques used to detect features, what are some techniques that take advantage of the knowledge to protect privacy? How does those techniques then affect the development of those models?

YileC928 commented 2 years ago

Would be appreciated if could be directed to some hands-on parallel computing tutorials for deep learning.

Qiuyu-Li commented 2 years ago

Given the large computing component with these models, can someone explain to me how TPUs work and what their benefits are over GPUs? I think James mentioned in class that one of his Post Docs was looking into getting one of them for his own project. I don't think I've heard of them until then and also I noticed google collab has them as a computing option.

I'm also curious about Gabe's question. There's also a TPU option on Google Colab, and I'm always wondering what I can do with it.

LuZhang0128 commented 2 years ago

I believe my question here is: How much more can we get out from videos and speeches. There are always trade-offs in the computational world, just like the GPT2 model we were training last week. Video, pictures, and recordings require a lot more computational resources than text, which we as students normally can't get access to. The same amount of resources may allow scholars to conduct several researchs using text. I thus wonder what is the cost and benefit of doing so?

Jasmine97Huang commented 2 years ago

I am interested in learning more about how deep networks perform latent feature representation in computer vision and language modeling.

hshi420 commented 2 years ago

Could we have some auto annotation if the visual contents, and utilize the annotation along with the visual features together for analysis? Would this method make any sense?

chentian418 commented 2 years ago

I am interested in the process of speech recognition: First an HMM generates a sequence of phonemes and discrete subphonemic states (such as the beginning, middle, and end of each phoneme), then a GMM transforms each discrete symbol into a brief segment of audio waveform. Can you clarify how is training involved in both HMM and GMM, and is there pre-training and fine-tuning process?

Hongkai040 commented 2 years ago

Can some one explain how GCN maps examples onto a sphere? The equation is not even intuitive...

Emily-fyeh commented 2 years ago

I would want to know more about how to distinguish that deep learning algorithms are really learning/solving tasks in a reasonable, or 'human-like' way? There exist many noises or even adversarial attacks that may create an illusion of model efficacy.

hsinkengling commented 2 years ago

When imagining research using visual/audio data, I feel like there are just so many different types of data that we could find. Similar to how different classes of texts (eg news, comments, lyrics, utterances) might need different kinds of models, I wonder how we could incorporate what we know about the characteristics of the data into the models themselves? Another way to put this is, generally, how mature and specialized are these technologies in regards to their leverage to specialized social research?

ttsujikawa commented 2 years ago

I wonder how deep learning algorithm could overcome large difference in English accent in each community. Moreover, in a diverse setting, would auto transcribing technologies become useless?

sizhenf commented 2 years ago

I'm curious about why knn models are used other than other models?

melody1126 commented 2 years ago

For image analysis, there seem to be two levels of contrast analysis – global and local. For most image analysis, it seems helpful to have both global and local contrast normalization. We can use one method to validate the results of another. Are there any circumstances where the two methods would produce contrasting results?

chuqingzhao commented 2 years ago

What's kind of social science questions are excellent fits for using deep learning rather than machine learning? As for NLP, when should people use BERT or contextual embedding instead of word2vec approach?

UChicago-Computational-Content-Analysis / Readings-Responses-2023

9. Images, Art & Video - fundamental #5