Images, Art & Video - Goodfellow, Bengio & Courville 2016

jamesallenevans commented 4 years ago

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. Chapter 12.2 “Applications: Machine Vision.".

lkcao commented 4 years ago

This session introduces how to normalize contrast as preprocessing of image both in global and local scope. I am wondering what negative influence there may be if we do not have this step (i.e., normalize the contrast in the image)? And does black/white picture needs this normalization?

ckoerner648 commented 4 years ago

The authors describe a computer vision application that recognizes sound waves from objects in a video but note that “most deep learning research on computer vision has focused not on such exotic applications that expand the realm of what is possible with imagery but rather on a small core of AI goals aimed at replicating human abilities” (pp. 447-8). I have to admit that AI-generated captions for images as those reported in a different reading (LeCun et al 2015, p. 440) are impressive, I think that the potential for academic progress is often greater when we choose exotic applications. We could for example use AI on the video recorded by surveillance cameras in bars to identify the patterns that precede violence. This would be a strong empirical test for Collins’ theory of violence.

katykoenig commented 4 years ago

The computer vision section describes the mathematical differences between global contrast normalization, sphering/whitening, and local contrast normalization, but when in practice would we use each of these? What are some good examples of these processes in a machine learning application?

rkcatipon commented 4 years ago

@katykoenig I think these techniques are used in the clarifying of images or detection of features, such as the detection of a blurry figure in the background of a video. Given the growing adoption of surveillance technology, I wonder what can be done to counter-act computer vision and what technology could be produced to protect privacy?

luxin-tian commented 4 years ago

When Ch9.1 characterizes the convolution operation, the authors introduce a weighted sum (or integral when continuous) of the values along all the axis. It seems that the summation (integration) performed on all the dimensions, that is, if the input has two dimensions m and n, domain for summation (integration) would be R^2. However, I wonder if it is the case that in practice we always "convolute" over all the dimensions? I can imagine doing so would capture more information for feature extraction, but is it possible and in what cases might we only use a subset of the dimensions in the space?

arun-131293 commented 4 years ago

When Ch9.1 characterizes the convolution operation, the authors introduce a weighted sum (or integral when continuous) of the values along all the axis. It seems that the summation (integration) performed on all the dimensions, that is, if the input has two dimensions m and n, domain for summation (integration) would be R^2. However, I wonder if it is the case that in practice we always "convolute" over all the dimensions? I can imagine doing so would capture more information for feature extraction, but is it possible and in what cases might we only use a subset of the dimensions in the space?

As you may have noticed, convolution decreases the feature size from the input image. The output of the convolution with reduced size is input to the next convolution layer which will further reduce the dimensions. In other words, the purpose of the earlier convolutions is not only to capture more abstract features useful for the downstream task at hand(object detection, violence detection etc.) but to also reduce the dimensions in the space to begin with. Since convolution layers successively decrease the dimensions of the image, therefore the idea behind convolution itself is in part to reduce the dimensions. This is one the main reasons they are good for images, which as you imply have high dimensions usually.

Having said that, there could be some heuristics to not use parts of images depending on the task at hand, but it's not terribly important as it would be for regular Neural Nets.

arun-131293 commented 4 years ago

The limits to image processing using the purely statistical approaches explored here are well known now to be mainly because pure statistical approaches don't reflect human visual processing, despite the author's correctly saying, AI goals are geared "at replicating human abilities” as @ckoerner648 points out.

For instance, NNs that perform object recognition are bad at recognizing toasters with fruit company logos on them and instead predict them to be bananas as outlined in the above article. Is it possible to incorporate the "human knowledge" that an object has a structure of which some parts are non-essential in recognition(like fruit company logo stickers) and some are essential(like the existence of stem/certain colors for a banana) and other aspects of our knowledge into Neural Nets?

laurenjli commented 4 years ago

The authors mention a technique where you can use "a neural network called the gater to select which one out of several expert networks will be used to compute the output, given the current input." This sounds similar to the ensemble models that we discussed earlier in the quarter. If this analogy is correct, are gaters as commonly used as ensemble models? Why or why not?

ccsuehara commented 4 years ago

Hi, i wanted to know how augmenting the dataset by adding extra copies with transformations work. Does the the precision/accuracy improve? Since i think the images are (after a series of steps) translated to matrices, can linear dependence be by-passed? How?

jsmono commented 4 years ago

I'm still a little bit confused about how they accomplished Eﬃciency of edge detection. What the algorithm is actually counting in this case and what determines the edge of a picture? Based on the picture, it seems like the value of the color is key in the algorithm, but does it mean that the algorithm will fail if a picture has several colors with the same value? Or this can only be applied to black and white pictures?

sunying2018 commented 4 years ago

I am interested in the recommender systems part. As mentioned in the article, there is a basic limitation of collaborative filtering systems -- 'cold-start recommendations'. One possible solution is to introduce extra information about the individual users and items. As we know, there are huge amount of features about one individual, how can we evaluate the efficiency of these features in different recommendation tasks, such as movie recommendation and shopping recommendation, we have different focus on feature selection.

alakira commented 4 years ago

I wonder how could we, or could we do dataset augmentation on text. In image recognition, we usually do rotation or flipping the pictures to augment the sample, but could we do it with text? or is it worthless?

heathercchen commented 4 years ago

I am a little confused about what "dynamic structure" means as mentioned in p.443. What "dynamic structures" are like or what is the inherent "structure" of it? Does it take the form of certain input and corresponding deep learning methods?

deblnia commented 4 years ago

I would echo @rkcatipon 's privacy-concern, but re-frame the question. I don't think the onus should be on us (as individuals, or as consumers) to protect our own privacy in the ad-hoc way technology encourages. I know there have been movements (like notechforICE, techwon'tbuildit, neveragain.tech) of researchers and industry professionals refusing to contribute to a field like CV because of ethical concerns. As a historical question, to what extent are these movements worth studying? Is technological "progress" inevitable?

skanthan95 commented 4 years ago

From this reading, I'm understanding that CNNs work by breaking up images into overlapping, squared-shaped patches of pixels, and subsequently summarize the information contained in each patch. I'm sorry if this was already clarified, but - how do CNNs relate to the LoG, DoG, and DoH blob detection techniques that we've been working with the week 9 notebook? Are these completely distinct approaches to accomplish the same goal (edge detection?) What are the overlaps and contrasts?

sanittawan commented 4 years ago

In the contrast normalization section, the chapter talks about various methods. I am wondering if there is a rule of thumb on how to choose, for example, global contrast normalization over local contrast normalization. Is there a good way to detect the kinds of normalization a data set would need?

HaoxuanXu commented 4 years ago

I'm very interested in the asynchronous gradient descent model for parallelized processing. I'm not sure if there are such processes in Python or if it's only available for languages such as java.

YanjieZhou commented 4 years ago

As the authors state, it is actually a pity to ignore the great values behind the research of AI applications on image processing, which can come in very handy when it needs AI to replace people to do something like surveil. So I am wondering how this can be put into practice.

wunicoleshuhui commented 4 years ago

I'm confused about the application of convolutional networks. Why do we want to use k-means neighbors in training convolutional networks rather than training using a whole layer?

ziwnchen commented 4 years ago

I realize that deep learning is such a powerful tool in computer vision, especially at "replicating human visual ability". But I'm also curious to know is there any deep learning study in computer vision that build models which exceed the ability of human at some perspectives? As stated by the concept of reinforcement learning, machines might do better by stopping imitating humans?

kdaej commented 4 years ago

The core element of machine learning is training the previous data to generate or predict future datasets. However, language does not always follow the old path but often spontaneously happens like an accident. New phrases and words are created and used by people without having no clear origin. How can machine learning be used to find the meaning or the origin of these kinds of expressions?

VivianQian19 commented 4 years ago

The chapter touches on deep learning techniques in computer vision, speech recognition and NLP. Computer vision equires little preprocessing but does require the images to be standardized; other techniques are used such as dataset augmentation to reduce generalization error but they might not be necessary (448). It also requires removing contrast in the images. I’m a bit confused about importance sampling under 12.4.3.3. Is this technique used only when there is a sparse vector and for deep learning?

meowtiann commented 4 years ago

The core element of machine learning is training the previous data to generate or predict future datasets. However, language does not always follow the old path but often spontaneously happens like an accident. New phrases and words are created and used by people without having no clear origin. How can machine learning be used to find the meaning or the origin of these kinds of expressions?

I think that's also one of the Prof. Evans' topics in knowledge of science. If we can predict new phrases, we can also predict ground-breaking researches. Tracing back history of development might be easier but to predict less-predictable future is hard.

cindychu commented 4 years ago

Preprocessing seems to be very important in computer vision as well. However, it seems to be more complicated to do this kind of pre-processing, for example described in the chapter ‘remove some kind of variability in the input data that is easy for a human designer to describe and that the human designer is confident has no relevance to the task.’. which seems even more complicated than simple machine learning task. so I am wondering is there any typical or standard way of image preprocessing? remove what kind of variation in image is critical in computer vision?

cytwill commented 4 years ago

This is also a conclusive paper entailing the application of deep learnings. I noticed that in the beginning part, the author mentioned the importance of writing CPU and GPU efficient code for running deep learning models, which I have a strong impression from our homework notebook. My question is that are there any ideas or approaches to decrease the computational demand of deep learning models so that they can be more easily set up for individual users on their laptops?

Computational-Content-Analysis-2020 / Readings-Responses

Images, Art & Video - Goodfellow, Bengio & Courville 2016 #49