Computational-Content-Analysis-2020 / Readings-Responses-Spring

Repository for organizing orienting, exemplary and fundament readings, and posting responses.
0 stars 0 forks source link

Images, Art & Video - Fundamentals #42

Open HyunkuKwon opened 4 years ago

HyunkuKwon commented 4 years ago

Post questions here for one or more of our fundamentals readings:

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. Chapter 12.2 “[Convolutional Networks]http://www.deeplearningbook.org/contents/applications.html.” MIT press: 326-366.

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. Chapter 12.2 “Applications: Machine Vision.” MIT press: 447-453.

nwrim commented 4 years ago

Most deep learning research on computer vision has focused not on such exotic applications that expand the realm of what is possible with imagery but rather on a small core of AI goals aimed at replicating human abilities. (p. 447 - 448)

Although while I think this is very much true, I am having trouble imagining what could be some things a computer vision algorithm can do that does not dwell on the boundary of human abilities. (The authors mention that "vision is a task that is effortless for humans and many animals but challenging for computers" (p.447)!) Can you give us some examples of what kind of application of computer vision could be "exotic"? I mean in some sense, recognizing 30 images in <0.5 seconds is well beyond human capacity, but I am talking more about some things humans cannot do even if given ample amount of time. Furthermore, I think social science is all about humans - what kind of application could such "alien" computer vision algorithms bring us in this epistemological space?

wanitchayap commented 4 years ago

In general, how do we choose which type of model to use in a certain task? I kind of know that CNN is better for computer vision. But when we do NLP tasks that have different goals, how do we know when to choose CNN over RNN? Or we need LSTM, bi-LSTM, Transformers, etc.? Many computational heavy papers that I have read would train on many types of model because the goals of their papers are to know which model perform the best. However, in terms of social science research, could we do something that is more related to motivated theory? Do you think it is still the best practice to explore many models and use the one with the best performance regardless of the type of the research?

Yilun0221 commented 4 years ago

I love image identification and I think this chapter is related to this field. This chapter introduces fantastic techniques like global/loca contrast normalization, whitening and sphering. My question is, are there any specific cases where one of these methods are particular helpful? I mean, under which circumstance we should pick a certain method in the mentioned list to tackle problems?

linghui-wu commented 4 years ago

While amazed by the power of object recognition, I still want to raise my concerns on privacy issues relating to the applications of computer visions. Deep learning-based computer vision algorithms made contributions such as monitoring social distancing in the working space during the pandemic; however, the widespread adoption of computer vision technologies also puts threats in our daily lives. Therefore, how should we treat the intrusion into privacy brought about by the progress of technologies? Apart from legislative approaches and privacy-protecting algorithms under studies, what else can we do?

timqzhang commented 4 years ago

For “Applications: Machine Vision”

This chapter provides a comprehensive introduction to the identification of images. Particularly, the in the section of Dynamic structure, one method called gater is introduced to select certain expert neural networks to compute the output. I wonder if this method is similar to the ensemble method as we learned before. The selection of the ensemble method to some degrees is quite alike to the gater, which chooses the improper methods "out of the gate". Is there any difference between these two, or the norm "gater" is a sub-category for neural networks under the general ensemble method?

tianyueniu commented 4 years ago

For Applications: Machine Vision

It was mentioned in the article that "Evaluating the performance of a model on a link prediction task is difficult because we have only a dataset of positive examples (facts that are known to be true)." Other than those explained the the text, what are some of the other common methods that we can use to bypass this limitation?

WMhYang commented 4 years ago

Since I do not know much about computer vision, I am very confused with the idea of reducing the amount of variation. From my point of view variation could be important when we humans identify real items, e.g. we can distinguish a candle and a lighting bulb. As a result, I can't really understand why removing contrast is safe. How could we define the degree of variation that could be safely removed?

DSharm commented 4 years ago

My question echoes the sentiment of @linghui-wu 's question. In early 2020 a series of articles uncovered an investigation into Clearview AI - a computer vision company that sells its product and database to law enforcement agencies to help them identify individuals. When the news came out, there was an ongoing discussion of the idea that the technology itself wasn't that novel - in fact, an organization like facebook could have built something similar a long time ago - but creating a product for law enforcement use was a line even facebook wasn't willing to cross.

How, then, should we think about making advances in this field, and improving our models to have the highest accuracy / quality, while knowing that doing so further endangers our privacy? Beyond legislative solutions, do researchers and computer scientists have a role to play in preventing such technology from being misused?

Article here: https://www.nytimes.com/2020/01/18/technology/clearview-privacy-facial-recognition.html

ihsiehchi commented 4 years ago

Application to film studies

Back in the glorious pre-quarantine days, I attended a lecture held by the film studies department at the Logan Center. The speaker was talking about a "freeze-frame (forgot the technical name)" technique to analyze films, which is essentially using selected frames as opposed to clips as the unit of analysis. I was wondering whether machine learning methods could help us find the frame with most information if we define a neighborhood. Without audio analysis and speech-to-text then text analysis, we most likely have to stick to films in which the most important scenes occur through actions rather than words. It is hard to believe that an algorithm can accomplish the task by going through facial expressions alone, although in films such as the Marriage Story, brilliant actors indeed have memorable facial expressions that may help us identify in which moments the climax of the film takes place.

jsgenan commented 4 years ago

It is genius to represent sound waves as figures in deep learning model! However the previous readings were talking about training machinery to recognize square patches in images. Can we talk a little more about how are the two dimensions, time and frequency, combined and processed in the model? Furthermore, is it possible to use the same philosophy to add one more dimension of vector in other deep learning models?

Lesopil commented 4 years ago

Re: the section about preprocessing and training machine vision models. What is done to prevent someone tricking the model by adding noise to a picture of cat to make it seem like a dog to the computer, but still clearly recognizable as a cat to humans?

liu431 commented 4 years ago

For user-generated images on social media, are they open to be scraped by researchers? Will people be less willing to post images of themselves or silly ones if they realize they are being monitored and analyzed?

minminfly68 commented 4 years ago

This computer vision section might have excellent application in the real world, like detecting human behavior in the real word, however, I am wondering how to solve the problem that it might offend humans' privacy and what can we do to protect humans' privacy? Thanks!

timhannifan commented 4 years ago

A unique feature of computer vision discussed in the reading is the ability to identify salient features through altering image contrast. My understanding is that this process helps to distinguish important information from background noise. In a social context, could one use these techniques be used to identify changes in the zeitgeist through time? For example, could one not only classify Romanticist vs Impressionist paintings, but identify the salient features that distinguish the two periods?

bazirou commented 4 years ago

My question is also about the stability of embeddings, is there any common validation method used in empirical study help us to state that the embedding actually makes sense?