UChicago-Thinking-Deep-Learning-Course / Readings-Responses

1 stars 0 forks source link

Week 6 - Possibility Readings #12

Open bhargavvader opened 3 years ago

bhargavvader commented 3 years ago

Post a reading of your own that uses deep learning for social science analysis and understanding, with a focus on Images, Sound, & Video.

Yilun0221 commented 3 years ago

Title: Behavioral Economics Approach to Interpretable Deep Image Classification. Rationally Inattentive Utility Maximization Explains Deep Image Classification

Summary: This paper by Drs. Kunal Pattanayak and Vikram Krishnamurthy discussed whether the deep image classification results comply with the assumptions of utility maximization behavior in behavior economics. To be specific, the researchers studied whether “a deep CNN is equivalent to a decision maker with behavioral economics constraints”, which is rationally inattentive Bayesian utility maximization, and the answer is yes. There are two models in behavior economics with related theorems are used in this research, including “utility maximization rational inattention (UMRI)”and “UMRI for a collection of agents (CUMRI)”. In the experiments, the researchers applied various deep convolutional neural network architectures with different numbers of hidden layers and different models for the hidden layers on CIFAR-10 image dataset. However, all the layers adopt an exponential linear unit. After proving the robustness and the sparsity of applying those behavior economics models on the image dataset, the researchers tuned the hyperparameters to achieve the best model performance on generating decisions on image classification, where the results confirmed with the hypothesis given by behavior economics theories.

Expansions to social science analysis: As the researchers discussed in the paper, the methodologies can be used to extract image features not only in behaviors economics, but also in game theory studies. From my perspective, it can also be extended to microeconomic studies, like simulating people’s behaviors in auctions. Also, it can be used to predict people’s reactions towards a policy which can serve as a reference for policy makers. However, I would like to discuss more about how to apply this methodology into empirical research more on economics data.

New dataset exploration: I want to simulate people’s bidding strategies in an auction. People’s bidding prices and the true values of the bidders can be accumulated, and we can compare the bidding result and the prices.

Raychanan commented 3 years ago

Title Artificial intelligence for sex determination of skeletal remains: application of a deep learning artificial neural network to human skulls

Summary This study demonstrates that artificial intelligence methods basedon neural networks are ideally suited to the task of sex determinationfrom skeletal structures.The only input into the artificial neural network in this study was the ectocranial image of skulls. Without any instruction or pre-existing knowledge of sex-dimorphicanatomy, the neural network was able to learn skull features that wereuseful in predicting sex. When tested on a validation set of skulls that ithad never previously been exposed to, and derived from a relativelyethnically diverse population, it demonstrated high accuracy in sexdetermination. So, in a word, neural networks can be trained to estimate the sex of an individual from skeletal remains with high accuracy.

Expansions to social science analysis I think that maybe the same image recognition technology can be applied to other fields of archaeology as well? For example, in the past, people used to find the location of ancient tombs by inference through experience and knowledge. In the past decades, humans have accumulated a lot of geographical information about these archaeological addresses. Therefore, by combining deep learning and satellite images, models will likely give geographic information about those remains that have not yet been discovered.

Possible datasets The dataset involved in this attempt should be readily available: 1. satellite images (probably easily available from Google Earth) 2. geographic information about ancient remains that have already been discovered

nwrim commented 3 years ago

Using deep learning and Google Street View to estimate the demographic makeup of neighborhoods across the United States 2017. Gebru et al. From PNAS

The article is a bit old (2017) but it shows how we can (cleverly) use deep learning for urban study without too much complication and still draw up interesting results. (Also, the senior author is Fei-Fei Li who is the lecturer of the famous CS231n course, although I did not notice that until copy and pasting the info above)

  1. brief summary of the article The authors collected a gigantic dataset of Google Street View images (50M+ across 200 cities) and first applied an object recognition algorithm (Deformable Part Model; DPM) to detect cars in the Street View images. Then, they used Convolutional Neural Network (CNN; the algorithm was trained using Mturk labeled dataset on images from car shopping sites like cars.com) to classify the car into about 3,000 categories. Then, they used this classification data to see what kind of cars appear in each precinct and used this data to build statistical models that predict demographical information of that precinct. They found that the detected car categories within the neighbor sufficiently predicted various demographic information. I include one of the examples they gave below (from the abstract).

    if the number of sedans encountered during a drive through a city is higher than the number of pickup trucks, the city is likely to vote for a Democrat during the next presidential election (88% chance); otherwise, it is likely to vote Republican (82%)

  2. suggestion on how its method could be used to extend social science analysis The result of the study itself could have some interesting application to social science analysis. Census or ACS works in year-level scale (or maybe even decade level), but if we can use these large-scale street-view analyses to detect demographics, we could have more real-time, more compact time window demographics which is useful for many social science studies. More generally, I think this is a study that shows how the power of computer vision/deep learning can be applied to social sciences: you pick out entity(ies) of interest (car in this case), use computer vision/deep learning to extract the info out of large-scale image/video dataset, and use the resulting metrics on a social science problem. I think this kind of setting might lead to very interesting opportunities that were previously impossible in the social science domain (and hopefully a lot of low-hanging fruits?)

  3. describing what social data you would use to pilot such a use Using Google Street View images, but maybe using a different object to detect to classify will be an immediate but interesting next step. I also think the general methodology can apply to almost all naturalistic image/video datasets, as long as you have an entity or object that is related to a social science research question.

cytwill commented 3 years ago

Title: Where’s Wally Now? Deep Generative and Discriminative Embeddings for Novelty Detection Summary: In this research, the authors proposed a new framework for novelty detection on deep embeddings. The novelty detection here focuses on prediction tasks in computer vision and image classification, and was defined as the problem of training on inlier data not corrupted by outliers, and making inferences on new observations to detect outliers. In their framework, the novelty detection task was processed with two steps: a) embedding the image either via generative (CNN, RNN, etc) or discriminative (GAN) networks, then b) computing a novelty score based on reduced or normalized image embeddings, and determining whether the test image is novel or not (0/1). The general idea of calculating the novelty score is to characterize some notion of “distance” computed from a test image to training inlier exemplars since test samples consist of both inlier and outlier images. The authors tried four novelty calculation approaches to compute the scores: local outlier factor (LOF), one-class support vector machine (1CSVM), IF Isolation forest, and elliptic envelope (EE). Besides, the authors notified that the complexity of the detection tasks could be varied due to the different mixture of inlier and outlier samples in the test samples. In order to make their framework more general for problems of similar complexity, they proposed two approaches to quantify the complexity: KL-divergence based Complexity Assessment and Bayes error rate complexity assessment. Finally, the ND algorithmic performance is characterized as a trade space between the computed performance in terms of ROCAUC and a function of the complexity of the ND problem (AUC = f(complexity)). They used this framework performed both single-inlier-multi-outlier and multi-outlier-single-inlier problems with datasets of CIFAR-10 and IN-125. The results suggest that one of their generative methods outperforms other recently proposed methods.

Extension to Social Research: The framework proposed in this paper could have extensive applications in social science research. Firstly, we might use the framework to detect the ratio of novel posts of users on social media and see how these relate to people’s popularity on social media or network structures, or how these relate to the extremism of users’ online comments as the novelty here could also be interpreted as some polarized outliers (the EE measure). Secondly, the novelty score computation methods could bring possibilities to generate a spectrum of novelty, to better measure the relationships between novelty and other socio-psychological indicators. Thirdly, these measurements have the potential to be applied to other embeddings like text embeddings or network embedding.

New dataset exploration: This novelty detection methodology could be applied to any labeled image dataset, where users might use one or several (not all) class/classes as inliers and others as outliers to train and test. For image data that have no labels, using unsupervised methods to get some clusters and visualization might help specify which samples are inliers. Alternatively, defining a distance threshold on a small number of manually defined inliers can also help to generate the scope of inliers to train and thus make the novelty/outlier detection feasible.

pcuppernull commented 3 years ago

Zheng et al. 2021. Deep Co-Attention Network for Multi-View Subspace Learning.

Summary:

Deep learning models that leverage multimodal data often are inhibited by opaque interpretability. In this paper, the authors propose a deep adversarial co-attention model for multiview subspace learning – called ANTS -- which extracts information in an adversarial setting to provide interpretations of the predictions produced by the model. In effect, ANTS extracts both the shared information across images and the view-specific information in an adversarial setting, providing more comprehensive information to the user. This is done by using a co-attention encoder module to identify the shared information across images, projecting the common information back to the original input space with a decoder module, and using the residual between the original and reconstructed inputs as the “view-specific” information. In the case of multiple images (for instance, different angles of photos in a facial recognition setting), this framework can be useful to identify which aspects of which images led to a prediction.

Social science extension:

Interpretability of models has typically been paramount in social science research. It is likely that the difficulties in interpreting many deep learning models have slowed their adoption by social scientists -- ANTS represents a significant leap forward in model interpretability.

Research Proposal:

In 2016, the Central Intelligence Agency (CIA) declassified the daily intelligence reports delivered to the President of the United States from 1961-1977. These “Presidential Daily Brief” documents include text, maps, and various redactions. A researcher may be interested in understanding what types of information are redacted from these documents, as redactions indicate particularly sensitive pieces of intelligence that may shed light upon state priorities. Using ANTS, a researcher could use both maps as image data and the document text in a multimodal setting to predict whether text on a particular topic has been redacted -- often, the heading of a section remains (like “South Vietnam”), but the body of the section is redacted. ANTS could then help the researcher recover what pieces of text or parts of the maps were most influential in making the prediction, which could point the researcher to clues in the unredacted text and maps as to why certain text was redacted.

ajahn6 commented 3 years ago

Multimodal Meme Dataset (MultiOFF) for Identifying Offensive Content in Image and Text

Summary: Suryawanshi et al. aim to take a multimodal approach to identification of offensive content in memes. Meme classification is difficult because of their typically multimodal nature (image and text), compounded by the propensity for meme meaning to be obscure and conveyed through humor and sarcasm. Suryawanshi et al. build a dataset of 743 memes from around the 2016 presidential election and employ volunteers to code them for different kinds of offensive content, including hate speech, violence, targeted or untargeted offense, etc. The text of the image is extracted and word embeddings are fed through a LSTM to obtain a full embedding for the text of the image. The image is fed through a CNN using the pretrained VGG16 model. The text and image embeddings are concatenated and a classifier is trained. Suryawanshi et al. found that the multimodal outperformed most classification using only text or only image data on recall, but still only achieved max score of 0.66. The authors suggest that an ensemble model could improve precision and recall scores. They also suggest that feature embeddings including information such as tags, more accurately classified data (there were some difficulties with human classification of memes as offensive or inoffensive), and a larger dataset could improve categorization going forward.

Social science extension: Social media discourse, especially in image-intensive spaces such as Tumblr, Instagram, or 4chan among others, is substantially mediated through multimodal meme formats. Limiting discourse analysis in these spaces to only text or only image content analysis risks excluding important loci of communication. This paper demonstrates some of the promises of multimodal processing for memes, as well as identifying shortcomings and potential areas of improvement. Incorporating feature embeddings for memes that include information on their associated communities or networks of origin, comment data, perplexity, etc. could be useful in a number of social media analysis applications, including tracking the flow of discourse structures and ideas-as-memes in different communities, correlating styles of discourse with propensity towards violence or certain ideologies, and incorporating memes into sentiment analysis with regards to certain topics.

New dataset exploration: Training the multimodal model on a labeled dataset of memes from delineated political groups on Facebook (right wing and left wing) could be trained on memes posted by older individuals and tested separately on other memes from older posters and memes from younger posters (still classified according to political leaning). We could then get a prediction accuracy score for how well a model trained on posts from older individuals can categorize posts from younger individuals. This would be a quantifiable metric to assess relative difference in discourse patterns between older and younger posters on Facebook for relative political leanings, showing how much or how little of a generational divide there is among different political camps and their social media usage.

k-partha commented 3 years ago

UNITER: UNiversal Image-TExt Representation Learning (Microsoft AI Research)

The authors introduce UNITER, UNiversal Image-TExt Representation, that can create joint image-text embeddings, which are the bedrock for most Vision and Language tasks. These tasks involve the processing of multimodal inputs - simultaneous processing of visual and textual information - to understand the joint representation. An example of this would be images with related text captions - these could be from blogs, albums, social media posts etc. The model is learned through large-scale pre-training over four image-text datasets, with the aim of creating a more generalizable model than previous attempts at Visual and Language models. which can power heterogeneous downstream V+L tasks with joint multimodal embeddings. They design four pre-training tasks: Masked Language Modeling(MLM), Masked Region Modeling (MRM, with three variants), ImageText Matching (ITM), and Word-Region Alignment (WRA). UNITER achieves state of the art performance across six V+L tasks over nine datasets.

Social science extension: A significant proportion of social media posts are image based and often come with the context of text either accompanying it or being hardprinted in the images themselves. The combination of text and image is key to the meaning of the post as a whole - an extreme example being GIFs. By and large, we are yet to devise methods that can capture the meanings of such communications. A joint image-text representation is a promising start in this endeavour. This could be a valuable tool in encoding social media posts and other forms of multi-modal social communications for a wide variety of social science research tasks downstream.

New dataset: Tweets are a classic example of social media posts that combine images and text. This model could play a valuable role in my project wherein I aim to classify user personality based on their tweets. Instead of separately encoding text and images from tweets, using this model would be more efficient - both in terms of complexity and in terms of information loss (as a joint representation would contain more than the sum of its parts). The additional information could improve the overall ability of my downstream classifiers to classify Tweets.

bakerwho commented 3 years ago

Eun Seo Jo & Timnit Gebru. (2020) Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning. FAT 2020 URL

Summary In this important paper, Jo and Gebru outline the increasing need for socioculturally based strategies for inclusive datasets in machine learning. Especially with the context of bias in facial recognition, employability prediction, recidivism and many other machine learning algorithms, an important body of work has focused on debiasing these algorithms from a mathematical or operational perspective. Jo and Gebru trace the problem back to the dataset, drawing from library sciences and other fields to offer simple and actionable steps that can be taken towards inclusivity and representation.

Social science extension This paper is already a kind of social science extension. In thinking of their past work on Datasheets for Datasets, it would be an interesting project to use frameworks proposed by such works to actually audit many publicly available datasets. This is, of course, a non-trivial task, whether to be done algorithmically or by humans. Often, the documentation surrounding a dataset is less to do about how it was collected, and more to do about licensing, access, size, and format. Some background digging, or web-scraping, may be required for such an endeavour.

New dataset It would certainly be possible to audit some well-known algorithms and benchmarks for their sociocultural inclusivity on the basis of this work. The GLUE benchmark used in natural language understanding is one such readily accessible work.

jsoll1 commented 3 years ago

Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks. https://arxiv.org/pdf/1312.6082.pdf

I picked this reading because it's extremely related to my class final project, as it's working with detection algorithms from streetview images.

Summary: This paper outlines a new approach to be able to tell what numbers read as from streetview images. They use the DistBelief approach on a deep convolutional neural network to train this model. Their best performance happens with eleven hidden layers. They have greater than 95% accuracy both on this task as well as on reCaptchas, which is impressive as the reCaptcha's is one of our best approaches to tell whether the operator is human or computer.

Social Science Extension: Better number detection algorithms could be useful in a number of different social science contexts. For example, this allows better recognition of the contents of internet memes, which can become low quality over time. Furthermore, numbers in streetview pictures can allow us to get more accurate information about qualities such as time of day. This is a useful tool in both increasing the range of information we receive from our images as well as increasing the number of images that are admissable.

New Dataset:

I'm interested in seeing if this way of getting better detection on numbers in streetview can be used on memes, which as they're screenshot multiple times are subject to large amounts of loss and lower image quality.

hesongrun commented 3 years ago

Persuading Investors: a Video-based Study https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3583898

Summary: This paper studies the pitches startup companies give to venture capitals. The research questions the authors care about are (1) Do delivery features, such as facial expression, tone of voice, and diction, matter for economic decision-making? (2)Through what economic mechanisms? Do they lead to better investment decisions? The main empirical methodology of the paper is to use the video as data. The authors used a machine-learning-based framework to process the videos and construct their variables. They adopt this 3-V structure: visual, vocal, and verbal. The data they observe are the pitch videos, the investment decisions, and the long-term development of startups. They have the following four findings: (1) The persuasion delivery matters for investment decisions. (2) Delivery-driven decisions are associated with lower investment success. (3) There are differences in using the delivery features to judge men and women (3) Inaccurate beliefs instead of personal preferences explain most of the pattern.

Social Science Extensions There are many other settings where people pitch their ideas to others. For example, the executive needs to pitch their strategies to the board of directors; the sell-side analysts need to pitch stocks to the investors, and etc. It is interesting to see if there are any commonalities among all the settings. What are the key factors in persuasion? Do they lead to better outcomes?

New Dataset I am interested in seeing the effects of promotion videos on the company's IPO or other security issuance activities. Since the pitch has an effect on venture capital decisions, it is likely that the roadshow delivery also makes a great impact on investors in the secondary market.

william-wei-zhu commented 3 years ago

Title: Attentive CutMix: An Enhanced Data Augmentation Approach for Deep Learning Based Image Classification

Summary: this paper introduces an image regularization technique called "attentive cutmix". It uses pre-trained models to identify and mask out the most important segment of an image to improving training generalizability. The researchers tested this regularization technique using CIFAR-10 and CIFAR-100 datasets and found that the technique significantly out performed baseline as well as other image regularization techniques including mixup and cutmix.

application to social sciences: this technique seems very useful to minimize overfitting in image recognition, for a variety of social science research objectives.

new data: I am interested in applying this technique to cities data (road images, traffic, buildings etc).