Week 7 - Possible Readings

Thinking-with-Deep-Learning-Spring-2022 / Readings-Responses

You can post your reading responses in this repository.

0 stars 0 forks source link

Week 7 - Possible Readings #13

Open lkcao opened 2 years ago

lkcao commented 2 years ago

Post a link for a "possibility" reading of your own on the topic of Sound and Image Learning [for week 7], accompanied by a 300-400 word reflection that: 1) briefly summarizes the article (e.g., as we do with the first “possibility” reading each week in the syllabus), 2) suggests how its method could be used to extend social science analysis, 3) describes what social data you would use to pilot such a use with enough detail that someone could move forward with implementation.

linhui1020 commented 2 years ago

Zhao Ren, Nicholas Cummins, Vedhas Pandit, Jing Han, Kun Qian, and Björn Schuller. 2018. Learning Image-based Representations for Heart Sound Classification. In Proceedings of the 2018 International Conference on Digital Health (DH '18). Association for Computing Machinery, New York, NY, USA, 143–147. https://doi.org/10.1145/3194658.3194671

This paper uses a pre-trained Image Classification CNN for the classification of abnormal or normal classification tasks for heart sound based on scalogram images of Phonocardiogram (PCG) recordings, showing that features from ImageNet VGG16 actually are more robust than ComParE audio feature set . The representations which extracted from a fine-tuned CNN achieves 56.2 % mean accuracy on the the classification tasks, which is a significantly higher than 49% accuracy of base model (conventional audio processing features and support vector machine). The authors attribute such increase of prediction accuracy to the feature extraction process from a fine-tuned model, because it is learnt features instead of features from a so-called fixed packages. During the data preprocessing, the authors select the first 4 s segments of heart sounds. By feeding the images into adapted VGG16, the authors gets more than 4000 attributes.
To achieve a transfor learning to contruct a robust classifier and adapt the parameters from VGG to the data, the author tried both a Learning Classifier of ImageNet and a Learning ImageNet. And the authors find that adapting the entire CNNs is better than simply updating last two fully connected layers.
This method is interesting because it provides a way for us on how to effectively extract features from our data, fine-tune the model for more suitable representations and how to train the CNN model for a sound and image combined dataset. In addition to digital health, I wonder whether this can be used to classifying the lier or not lier, comparing the audio record of a crime from its daily life and its audio record during interrogation. Besides, I wonder whether such method could be used for research about emotion change of individuals, which can be mapped with he audio record with the micro-expression. Through this, we can get some individual personality insights.
This method can be extended for studying like analyzing other health issues, for example, covid-19. There is a dataset which contains the cough sound of negative/positive cases on Kaggle platform, and I used this previously for assignment. Yet the pattern of audio does show some differences, for example, the audio from positive case is long-lasting and deep, while the audio from negative cases is short and more crisp. This may helpful for determining whether positive or negative if a testing kit is in short for some countries. Also, this may indicate for development of a software, complementary to the biomedical testing way, for the classification.

javad-e commented 2 years ago

Gebru, T. & Krause, J. & Wang, Y. & Chen, D. & Deng, J. & Aiden, E. & Fei-Fei, L. (2017). Using deep learning and Google Street View to estimate the demographic makeup of neighborhoods across the United States. PNAS. Available Online: https://www.pnas.org/doi/pdf/10.1073/pnas.1700035114.

1) Consider the following example from the paper: If the number of sedans encountered during a drive through a city is higher than the number of pickup trucks, the city is 88 percent more likely to vote for a Democrat during the next presidential election! Gebru et al. propose a deep learning model based on Google Street View Images to estimate the socioeconomic characteristics and political preferences of neighborhoods in the United States. The American Community Survey (ACS), as an example, costs over $250 million each year to collect information on such characteristics. The authors propose gradually replacing costly surveys like ACS with deep learning models. The used datasets consist of 50 million images from Google Street View from 200 cities. The authors implement a convolutional neural network to analyze the data. The focus of the study is on the cars appearing in the images. These cars are classified into 2657 groups based on characteristics such as make, model, year, etc. They then add more data, including the approximate price of each category. Although the focus of the paper is mainly on estimating political preferences, using the model and dataset described above, the researchers are also able to estimate other variables such as income and race based on the observed vehicles in the images.

2) I found it very interesting that so much information could be learned only by analyzing cars in Google Street View. I believe both the method and the dataset could be valuable to social scientists. There are a number of other studies using street view images. For example, Naik et al. (2017) use a similar dataset to predict and explain physical urban change. There are numerous other questions that can be answered using this dataset by focusing on different aspects such as cars, parkings, buildings, stores, and colors.

3) If only focusing on cars, one does not have to use Google Street View images in the first step. Instead, as a pilot project, we can examine the cars by installing a camera to test our hypothesis. Moreover, Google Earth satellite images are easier to access and analyze. Depending on the research question, we could also use high-resolution satellite images in the early stages.

borlasekn commented 2 years ago

Middya, A.I., Nag, B., & Roy, S. (2022). Deep learning based multimodal emotion recognition using model-level fusion of audio–visual modalities. Knowledge Based Systems. 244.

Link: https://www.sciencedirect.com/science/article/pii/S0950705122002593?casa_token=WNNiAR_kAQcAAAAA:lc10lJt2KWESGArOHnBJHkTWOvKXDeRtUvFmqfCM25PFxt4pnS2_NGp96EZVDL9K3oWJL8ztKt8

It can be important in analyzing audio and visual data for the purpose of measuring human communication to be able to capture human emotion. This paper takes this concept and explores the fusion of separate extractor networks for audio and video data for emotion recognition. These are in contrast to the traditional dimensional and discrete emotion models typically used in emotion detection. The models are assessed by two different benchmark multimodal datasets: The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) and the Surrey Audio-Visual Expressed Emotion (SAVEE). The model that they use achieved high predictive accuracies on both of these datasets. They then conduct case studies in order to explore their ability to capture variation in any emotional states of speakers in audio and visual media from the real world.
This research is particularly interesting not only because of its application in various fields from psychology to criminology. It has implications for human well-being and being able to read into people's emotions beyond the words that they are saying. This can often be a difficult task for humans to do without the help of machines. It stands that one might wonder whether a machine would be able to pick up on elements of speech or video that humans cannot detect. This work is also important because of the methods that it uses. Being able to build and combine extractor networks for different types of data can have implications beyond emotion detection in social science research.
If I were to move forward with a social science research project combining audio and video data extractor networks, I would potentially look at interviews with celebrities versus interviews with every day individuals talking about their work. I would want to answer the question: Is there a way that celebrities speak about their own work (movies, songs, etc.) versus the way that everyday individuals talk about their jobs? What makes celebrity personas different and more intriguing in society than other personas? I would gather both audio and video interviews from both groups and perform these emotion detection fusion models on them to measure the differences in emotion in the way people talk. I would likely combine this with text analysis on the words they are saying as well as combine this to see if I can build a binary classifier to see if audio or video to be celebrity or not, so that the emotion detection data could provide insight into the way that the classifier performs.

Emily-fyeh commented 2 years ago

Beskow, D. M., Kumar, S., & Carley, K. M. (2020). The evolution of political memes: Detecting and characterizing internet memes with multi-modal deep learning. Information Processing & Management, 57(2),102170.

The paper proposed Meme-Hunter, a multi-modal deep learning meme classifier that can distinguish memes and non-memes. For the inputs, the authors concatenate the text vector (generated by Glove), image vectors, and human face encoding to present the visual and contextual elements in memes. The Meme-Hunter model is a Joint DNN model combining the LSTM-based text classifier and the CNN-based image classifier, whose last layer has a sigmoid function to generate a probability of the image being a meme. Totally around 50,000 images are collected for the 80% training set, the 10% validation set, and the 10% testing set, with a balanced meme vs. non-meme ratio. The model outperforms other existing uni-modal models in terms of accuracy, precision, and F1 score.
The proposed meme generation model Meme-Hunter is used to analyze the pictures collected on Twitter during the 2018 US Midterm Elections. First, the paper classified the pictures into memes and non-memes and compared their descriptive properties. Memes have fewer likes/retweets and have shorter life spans compared to other forms of images. Second, they conducted graph learning with Fixed Radius Nearest Neighbors, showing the family network of memes. Second, the political conversations using memes are identified and tagged with the party affiliation. Furthermore, bot-hunter was used to estimate how many memes are auto-published by the bot; Also, face detection helped to recognize the face of the Democrat and Republican candidates involved in the memes. With Google Vision API, memes were proven to have propagated to multiple social media platforms than other media forms. Third, a sampled subset of election memes was used to compare with the performance of the former meme classifier. The result shows that although not significantly outperforming other models, the Meme-Hunter still holds a leading place when taking accuracy, precision, F1, and recall into consideration.
Since my group decided to explore memes in our final project, the figure preprocessing and the model training parts are perfect for our reference. I learned that when conducting OCR, turning the picture into black and white can increase the accuracy of text recognition–which would probably be adopted for our project. The analytical part of this paper has also thoroughly explored the possibilities of meme data implications. Since the saliency maps of the memes show that the model has learned to put emphasis on the position of texts in the picture, I would want to know if there are other human-explainable features that the model has picked up. I personally would like to explore the cultural interpretation of these political memes.

egemenpamukcu commented 2 years ago

Using Satellite Imagery and Deep Learning to Evaluate the Impact of Anti-Poverty Programs Luna Yue Huang, Solomon M. Hsiang, and Marco Gonzalez-Navarro

I think in this paper authors attempt to overcome a critical bottleneck in the evaluation of anti-poverty programs. These programs historically rely on repeated in-person field surveys to measure program effects and these are often very costly in terms of time and resources. Also still prone to errors due to human error. In this paper, the authors claim to show evidence that one can evaluate anti-poverty programs by using high-definition satellite images and deep learning methods. They estimate changes in household welfare in a recent anti-poverty program in Kenya by mainly using the housing quality inferred by the deep learning model through satellite imagery. Referencing previous research on the relationship between housing quality and wealth, they suggest that they can do impact evaluation at scale and track changes in poverty over time to a reasonable extent.
The universe of social science variables we can estimate by using satellite imagery is virtually limitless, especially as these images improve in quality and frequency. We can extend this to natural experiments as well. For instance, we can look at the relationship between fertilizer prices and agricultural output. Fertilizer prices are often susceptible to external shocks, especially in the developing world. We can compare regions that have experienced these shocks with the ones that have not and use satellite imagery to estimate baseline and endline agricultural output and crop type. This would allow us to measure the effect of these shocks, understand the heterogeneity of effects, as well as devise programs to mitigate the adverse effects of these shocks.
To carry out such a study, we would need at least two datasets. Firstly, a dataset on local fertilizer prices (or if that is unaccessible, prices of commodities that affect fertilizer prices), and an image dataset of agricultural sites with corresponding output figures. We would use the images and labels (agricultural output) to train a CNN. Then we would use this model to make predictions on unseen data. Finally, we would compare the two predictions (before and after price shocks) to estimate an average treatment effect.

yujing-syj commented 2 years ago

Nguyen, Thanh Tam, Thanh Dat Hoang, Minh Tam Pham, Tuyet Trinh Vu, Thanh Hung Nguyen, Quyet-Thang Huynh, and Jun Jo. "Monitoring agriculture areas with satellite images and deep learning." Applied Soft Computing 95 (2020): 106565. https://doi.org/10.1016/j.asoc.2020.106565

In this paper, the researchers aim to develop an autonomous and intelligent system built on top of satellite images to differentiate crops areas from non-crop areas. The problem is, seasonal nature of crop, the complexity of spectral channels, and adversarial conditions such as cloud and solar radiance will influence the satellite image processing. The authors propose a novel multi-temporal high-spatial resolution classification method with an advanced spatio-temporal–spectral deep neural network to locate paddy fields at the pixel level for a whole year long and for each temporal instance. This study could benefit the agriculture applications since land monitoring for food security control and support actions in the old way (requiring field works and surveys) is time consuming and costly. With this new approach, the government could use satellite data to do the accurate land monitoring.
I am very impressed by two methods that are used in this paper. First is about preprocessing the satellite data. Due to spatio-temporal difference, and adversarial conditions of satellite images, several pre-processing routines are necessary. Spectral normalization, geometric correction, solar correction, atmospheric correction, topographic correction, and radiometric normalization are some approaches that I could refer to when I need to deal with the satellite image. Second is the model implemented by the author. They propose a deep neural network architecture that integrates spectral, spatial, and temporal information at the same time. The network features multiple modules (sub-networks): (i) Input module – feeds the imagery data to succeeding layers, (ii) BiLSTM module – handles temporal patterns, (iii) Convolutional module – processes spatial and spectral dependencies by data pixels, (iv) Output module – returns the classification result. This multi-temporal resolution deep neural network for rice mapping is brilliant! This model could be used to detect the change of social science pattern using the satellite image.
If the detailed time series satellite images are available, the model could be applied to monitor the change and evolution of the urban or countryside. In terms of our final project, we could apply this model to city satellite image to classify the different kinds of building and correlate them with the gentrification degree.

hsinkengling commented 2 years ago

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text https://papers.nips.cc/paper/2021/hash/cb3213ada48302953cb0f166464ab356-Abstract.html

1. This paper develops Video-Audio-Text Transformer (VATT), a transformer-based neural network that combines video, audio, and text inputs simultaneously and achieves high accuracy on an assortment of downstream tasks. The training process uses self-supervision, where the different modal data are used to predict each other, requiring no human annotation. The technical contribution of the paper is that it demonstrates the potential of the transformer framework to be a versatile, general-purpose model that can fit any kind of data. To make the model work, the authors detailed the architecture of the model, which includes a modal-specific tokenization layer, a sampling method called DropToken, training methods such as common space projection and contrastive learning. The model performed well on tasks including video action recognition, audio event classification, and image classification.

2. One of the main challenges for conversation analysis is the strenuous work spent on manual coding interactions through listening to audio and watching videos. While I’m skeptical if VATT would be able to detect the most intricate details that conversational analysts code, it can surely help with coding coarser interactional events such as raising hands, speaking, standing up etc. One potential application is in the study of indoor mask wearing when it is no longer mandatory.

3. After obtaining CCTV data from a public source of people in indoor settings, we could train the VATT on such data and fine-tune it to detect actions or states relating to masks: taking out masks, wearing masks, adjusting mask fit, taking off mask, sneezing into masks etc. These interactions can then be aggregated into mask wearing proportions and put into a regression with covid rates, or analyzed as action sequences which can help us understand the (de-)motivations for indoor mask wearing.

pranathiiyer commented 2 years ago

The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes

This paper uses a multi modal approach to classify hate speech in memes. It constructs memes in association with Getty images and employs annotators to annotate memes as hateful or not hateful. The final dataset comprised of 10k memes. The authors use standard ResNet-152, and features from fc6 layer of Faster-RCNN, and BERT as image and text unimodal models for image and text respectively. They then use fusion methods such aa taking the mean of output scores of unimodal models, or concatenate their features. They compare these results with multimodal models such as ViLBERT. They achieve an accuracy of 87% and find that multimodal models do better, and that the more advanced the fusion, the better the model performs.
Memes are an integral part of communication and expression on social media. They are key elements of culture that represent political, social, and several other real world phenomena. This method of multi modal analysis of images and text can be integral towards understanding memes and their cultural implications on social media. This can help understand a significant part of social media culture, and how information prevails and penetrates on various platforms.
My group and I plan on using methods such as this on a large dataset of memes gathered from a few sources of data online. While the objective is not to identify hate-speech particularly, the method can be adapted to different problems of binary classification. We could explore identifying whether an images is a meme or not, misogynistic or not, political or not, so on and so forth. This method also sheds light on the utility of multimodal approaches while dealing with memes as opposed to unimodal approaches, which could be instrumental for our project.

ValAlvernUChic commented 2 years ago

Dank Learning: Generating Memes Using Deep Neural Networks

This paper introduces a meme generator system that not only aims to create legitimate memes but ones that are humorous. The model uses a pre-trained Inception-v3 network that returns an image embedding which is then passed to an "attention-based deep-layer LSTM model" that would produce a caption for the model. To encourage diversity in the type of captions, they employ a modified bean search. To evaluate the quality of the model, they use both perplexity, a classic measure used in information science, and human assessment. Ultimately, the model was found to produce memes that could not be wholly differentiated from real memes. Central to the study is the three different encoder variants. The first uses only image embeddings as inputs to the text generation model. The second employs concatenates image embeddings and averaged text embeddings from words that describe the meme image. The last uses the same type of embeddings but additionally includes an attention mechanism in the encoder architecture. The decoder mechanism for the model is largely kept constant, except an attention mechanism for the last model.
Memes have become one of the primary modes of communication for online communities. What makes it special is specifically what is lacking in more long-form, textual forms of commentary or discourse - a mode of communication that is accessible to the layman. The above paper shows hints that firstly, it is possible to generate an image that could be considered a meme and secondly, that it can be humorous. While the paper uses perplexity to try to measure exactly what makes a meme a meme, they use it to measure whether the model is "learning to caption images of different formats with the correct style". Notably, as the authors claim, "it is a limited metric for success as it tells us nothing about whether the captions are humorous, original and varied". This opens up opporunity for us to explore a measure for which we can decide a meme is a meme or not.
Our group is planning to use large meme datasets first building a classifier to idenitify different problems in meme identification. With these memes, we could use the images used in these memes, autogenerate a caption that describes it exactly, and use these literalt descriptions as negative samples for which we can use to contrast with the meme. The hope is we can use measurable differences between the two to see quantitatively what makes a meme a meme.

ShiyangLai commented 2 years ago

Racial disparities in automated speech recognition https://www.pnas.org/doi/10.1073/pnas.1915768117

The paper examines the ability of five state-of-the-art ASR systems-developed by Amazon, Apple, Google, IBM, and Microsoft- to transcribe structured interviews conducted with 42 white speakers and 73 black speakers. The researchers found that all five ASR systems exhibited substantial racial disparities, with an average word error rate (WER) of 0.35 for black speakers compared with 0.19 for white speakers. They further trace these disparities to the underlying acoustic models used by the ASR systems as the race gap was equally large on a subset of identical phrases spoken by black and white individuals in the corpus.

Although this paper does not propose a specific pilot ASR architecture, it does generate substantial implication for future efforts on ASR system development. More specifically, the findings of this paper underscores the importance of training diversification. Since ASR model trained on single audio corpus can be significantly biased, extending the training to other corpora that are under disparate recording context can reduce these performance difference and ensure speech recognition technology is inclusive.

I am now working as an RA on a project about police speech recognition and emotion detection. The project is working on a brand new police corpus and there is no pre-trained models can achieve good performance on our corpus. We conclude the reason exactly the same as this paper. Therefore, we are now extending the training to multiple corpus. This paper just makes me feel more confident about our current efforts.

JadeBenson commented 2 years ago

Gong, et al. “A multi-center study of COVID-19 patient prognosis using deep learning-based CT image analysis and electronic health records,” European Journal of Radiology 139, June 2021, https://doi.org/10.1016/j.ejrad.2021.109583.

Link to the full article: https://www.sciencedirect.com/science/article/pii/S0720048X21000632?casa_token=jLvrcwx_AToAAAAA:YvRgFE4HtpYesB6RIV8-wnOgL8Q_1aran89frcSKL5hqZPyJ9e2Im-2PesCNb0JEtxtjbVM

1) In this article, Gong et al. use deep learning to identify biomarkers from CT images through segmenting lung infection regions and combine these with electronic health records (EHR) to predict COVID-19 prognosis (ICU or death). They compared results of this deep learning model across three cohorts from different countries and received AUC values ranging from 0.85 – 0.93. These models can be useful for predicting COVID-19 prognosis to appropriately allocate resources. Methods details: The CT image segmentation was performed using a dense 3D network structure with encoding and decoding levels (final layer is a sigmoid activation function). Thresholds were then set to determine whether these regions were more opaque indicating consolidation and, therefore, more advanced/severe disease. These were combined with demographic and lab test information derived from the EHRs. A GLM was used to predict ICU admission or death. 2) I think this article is interesting because it demonstrates how we can combine multiple types of data through deep learning to create effective models in the service of public health. In this application, they only include patients that have tested positive for COVID-19 and predicts if they will escalate to a severe outcome. I would be curious to extend this model to incorporate time and include patients that are at the lower end of the spectrum. As the pandemic has worn on, vaccination and prior infections boost the likelihood that future infections will not be as serious. This time-varying dimension could help these models continue to be useful. The vast majority of patients who test positive for COVID-19 do not receive CT scans nor are likely to escalate to that severe of outcomes. Once a CT scan has been ordered, this is already a rarified and high-risk population. I wonder if these models could be expanded to be flexible enough to include CT images if available but also use other available data from those who are less sick too to identify earlier and hopefully have a better chance of intervening successfully. I remember audio recordings of coughs were suggested or perhaps a much simpler model with sociodemograpics and basic health indicators is sufficient. 3) If I were to move forward with this more generalizable model, I would first conduct a more thorough literature review to know what the current research is, since there has been an abundance of COVID-19 publications. But my own ideas right now would be to include testing dates in the GLM model perhaps with a spatial component as well since this might indicate areas with either more severe outbreaks and/or access/quality of medical care. I would certainly include other health indicators like BMI, other chronic conditions, smoking, etc. As mentioned above, the CT scan alone might indicate worse outcomes and audio recordings of coughs could be tested.

isaduan commented 2 years ago

memeBot: Towards Automatic Image Meme Generation

Link: https://arxiv.org/pdf/2108.03886.pdf

Image memes have become a widespread tool used by people for interacting and exchanging ideas over social media, blogs, and open messengers. This paper proposes to treat automatic image meme generation as a translation process, and further present an approach to generate an image-based meme for any given sentence using an encoder-decoder architecture. For a given input sentence, an image meme is generated by combining a meme template image and a text caption where the meme template image is selected from a set of popular candidates using a selection module, and the meme caption is generated by an encoder-decoder model. An encoder is used to map the selected meme template and the input sentence into a meme embedding and a decoder is used to decode the meme caption from the meme embedding. The generated natural language meme caption is conditioned on the input sentence and the selected meme template. The model learns the dependencies between the meme captions and the meme template images and generates new memes using the learned dependencies.
I like the idea of thinking meme generation as a translation problem! Furthermore, the model's ability to learn the dependencies between the meme captions and the meme template images can be leveraged to conduct socio-cultural analysis such as: What is the boundary of humor? Do we laugh at memes, because it's out of context? What semantic relation between the image and the caption gives rise to laughter?
Since our group project is about meme classification, we would love to consider how to repurpose the encoder-decoder architecture to design a classifier.

BaotongZh commented 2 years ago

A CNN-RNN Framework for Image Annotation from Visual Cues and Social Network Metadata Link: https://ieeexplore.ieee.org/abstract/document/9412275?casa_token=4I2KXSRV770AAAAA:yTaABmRiauZ2khfYg1z-Fskfmd7ayGnT1KMj9GiVQWzPGU1eJarhtrU0C2QZSE-5Rrz-FUfA

1) In this paper, the authors have shown that common visual models to classify images, based on metadata to retrieve neighbors, can be improved by considering semantic mappings and recurrent neural networks. The authors have characterized the performance of a variety of visual and joint models and their variability. Their models outperform for several metrics state-of-the-art approaches. They have also shown that semantic mappings can be highly effective in improving performance, besides achieving robustness to changes in metadata vocabulary and quality of neighborhoods. The authors jointly exploit visual features as well as tags that commonly accompany images on social networks. Tags(meaningful representation) are embedded using different semantic mappings.

2) This paper can be easily extended to some social science research regarding images and metadata associated with the images. The combined model of vision and metadata embeddings. For example, when we are dealing with detecting a spam image on online social media(like Twitter), the visual cues themself are not enough to classify an image. Therefore, the metadata and its embedding become quite significant. Also, we can extend these images and their metadata analysis to videos and their metadata analysis.

3) This paper is actually highly correlated with our group project: use satellite images and the metadata of their neighborhoods(in this case, is economic data and social network data). In this sense, we do not have to perform a supervised task but a semi-supervised task. For example, we may choose a satellite image of an area without any labels. But we have some metadata and its neighbors' label and metadata. Therefore, we could use this CNN-RNN joint model to perform a prediction task, which is to predict whether a place is going to be gentrified based on its neighbors labels and metadata. This may make our model more robust and have higher accuracy,

y8script commented 2 years ago

Using goal-driven deep learning models to understand sensory cortex https://www.nature.com/articles/nn.4244

In this review paper, the authors discussed the recent development in building hierarchical convolutional neural networks(HCNN) as models for single-cell and population-level neural responses in the sensory neural system within the brain. The basic idea is to first optimize the network parameters based on a task that's relevant to what the brain actually deals with, and then compare the neural networks to neural data. They argued that a neural network has to be effective at solving the sensory behavioral tasks in order to become a model of the sensory neural system. Studies have found that top hidden layers of categorization-optimized HCNNs can predict neuronal responses in Inferior Temporal(IT) cortex even better than ideal-observer models. Moreover, intermediate and lower layers of HCNN predict neuronal responses in V4 and V1 (which are well-studied visual processing regions). Thus, HCNN may act as a generative model for this specific cortical region.
This approach shows the possibility that neural networks, which are initially inspired by neural systems to some extent, can in turn be used as a model for actual neuronal responses. As visual processing is one of the most extensively studied regions in the human brain, and CNN also has been widely implemented and tuned, it's natural that the models for the visual system are the first successful cases. However, neural networks are also trained to conduct many other human-like tasks, and there is hope that we can draw analogies from some other neural networks to other brain regions.
Similar to visual processing, we may explore whether neural networks for audio processing(e.g. speech recognition) can also act as neuronal models of the auditory brain system. We can train neural network models on a task that is also performed by humans or other primates, then compare network parameters with human/non-human primate neuronal responses. However, I have to admit this is not something that can be easily implemented with existing data.

zihe-yan commented 2 years ago

Pepino, L., Riera, P., & Ferrer, L. (2021). Emotion recognition from speech using wav2vec 2.0 embeddings. arXiv preprint arXiv:2104.03502.

To tackle the problem that emotion recognition datasets are usually small, this paper proposes a transfer learning model using Wav2Vec 2.0. Besides combining outputs of a pre-trained model, this group of researchers also compares the results of the model before and after finetuning. By putting models into two datasets, IEMOCAP, and RAVDESS, the researchers confirm that this model outperforms the other model, which obtains the recall of 84.1% and 72.1%.
Speech has been an important part of social science research. In the communication process, although texts are carriers of information, emotions can be an inferring factor to convey contextual information. Such methods can be applied in the scripts of interviews and political speeches. For political speeches in particular, because many of them contain highly rhetorical words, the text may not convey the full information in the speech.
I think it would be interesting to apply this method to our group's project on the analysis of movie trailers. It's interesting to see if there are some kind of repetitive patterns in the characters of a specific gender. For example, to explore whether female characters are always the ones expressing fear or anger.

mdvadillo commented 2 years ago

Automatic Speech Emotion Recognition Using Machine Learning Leila Kerkeni, Youssef Serrestou, Mohamed Mbarki, Kosai Raoof, Mohamed Ali Mahjoub and Catherine Cleder link

The authors present a comparative study of speech emotion recognition systems. They compare three machine learning algorithms: MLR, SVM, and RNN, and use them to classify 7 emotions in speech and perform some feature extractions. The authors were able to extract Mel-frequency cepstrum coefficients and modulation spectral features from the data, They used two databases (Berlin and Spanish) to train and test the models. The paper studies how classifiers and features impact recognition accuracy of emotions in speech, and found that more features does not lead to better results. The best results for speech emotion recognition were found using an RNN.
These models and results have a lot of potential for application in the social sciences, since speech is key for human communication. Understanding emotion (and expanding it to understanding tone) can provide insights into political science (how politicians speak), psychology (relating to tone perception and response), and sociology (relating to communication among members of a group). The nuances encapsulated by speech that machine learning models cannot pick up by text can give additional insights into human relationships and behavior.
The methods and topics touched on by the paper relate to our final project, where we want to analyze speeches and compare and contrast the sentiments expressed in them conditional on context and speaker.

yhchou0904 commented 2 years ago

Recurrent Convolutional Strategies for Face Manipulation Detection in Videos The spread of misinformation becomes a more essential problem as people have the ability to generate images or videos containing unreal information automatically or with little effort. For example, there are models that could make fake videos using people’s faces, such as Deepfake, Face2Face, and FaceSwap. These models could transpose a person’s identity or expression and be more believable of the difficulty of fake video generation. However, most fakeness detection techniques are based on image processing, but this paper focuses on video-based face manipulation detection methods. The method includes two parts, face preprocessing and manipulation detection. For the first part, it uses landmark-based alignment and spatial transformer network to do people’s face alignment; and then builds models to do manipulation detection with some CNN-based models, including a recurrent-convolution neural network. The strategy of applying bi-directional recurrency could help us to examine the discontinuity of the frames and thus tell the existence of face manipulation. Ensuring correct information and doing fact-checking is one of the most important goals of social science research. The technique of manipulation detection could definitely help us not only to label the manipulation but also to warn people about the fakeness in real time so that we can prevent people from getting wrong information. At the same time, this model could help us to check the reality of data and prevent us from making wrong decision based on fake information. Also, the idea of face alignmnet could be used in different applications. For example, we could also use landmark alignment and spatial transformer networks to identify the change of urban development situation in videos of city scene. Also, by using bi-directional recurrent model, we could include the dependency feature in this kind of streaming data. As described above, we could apply this manipulation detection technique to video data that might be generated by models or algorithms. Especially the videos on social media, the ones people might not be aware of getting wrong information and thus there are more videos intending to mislead people.

min-tae1 commented 2 years ago

Visual and Textual Sentiment Analysis of Daily News Social Media Images by Deep Learning https://doi.org/10.1007/978-3-030-30642-7_43

This paper employs Deep Convolutional Neural Networks (DCNNs) that analyze both visual and textual features of Social Media images to improve the accuracy of sentiment analysis. The framework presented in the paper includes a visual feature extractor, a textual feature extractor, and the overall sentiment classifier. The result of the experiment, conducted on the SIMPSoN dataset, demonstrated the effectiveness of the approach.
Memes are a feature within social media and online communities crucial that requires analysis. Users can gain more attention and frame the issue by manifesting their beliefs in memes that include texts. Websites and apps now offer individuals ways to quickly add texts to pictures or photos. Hence, more and more content on the internet now involves both words and images. A deep learning method that takes account of the two modes is hence imperative to understand what is going on in cyberspaces. Images on subreddits discussing political issues could be analyzed to understand how a community feels about certain issues and how they may change over time. The same could be done to Facebook groups, and hashtags on Instagram and Twitter. Moreover, comparing similar communities existing on different platforms could also gain interesting results. For instance, there might be differences in the r/The_Donald and /qresarch/ in 8kun when it comes to sentiments on memes. This could also be beneficial in understanding the features of social media and how it affects the social movements of our time.
Employing Reddit pushshift dataset would be beneficial in understanding communities on Reddit. Scraping twits on Twitter based on hashtags would help us gain knowledge of social movements on Twitter. Finally, comparing sentiments on the image between similar communities on Reddit and Twitter would also manifest the role of platforms in online communities and social movements.

sudhamshow commented 2 years ago

Exploiting the Interplay between Social and Task Dimensions of Cohesion to Predict its Dynamics Leveraging Social Sciences (Best paper ICMI'21)

Summary : (Background - Emergent states are behavioural/cognitive states that emerge when people collaborate together (Eg: cohesion). When reserachers define the measurement of these emergind states, these results in multiple dimensions of measurements. The 2 dimensions currently studied are: The Social dimension - referring to the interpersonal bonds that exist between group members, the Task dimension corresponding to the group members’ shared commitment to the task) The authors aim to automate measurement of cohesion across different dimensions and study the interplay between the dimensions. Prior work had only been able to predict the level of cohesion or the intensity level of a particular dimension (social of task based). The authors intorduce a DNN architecture called the Transfer Between Dimensions to study the interplay between social and task dimensions. The authors use the GAME-ON data set which is specifically designed for the study of social and task cohesion. The dataset is multimodal (audio, video, and motion capture recordings) in which small groups of 3 friends interact in the context of an escape game. The authors use a pretrained model based transfer learning to predict task cohesion dynamics. They pretrain a model to predict social cohesion with information of both individual and group level social dynamics metadata. The model uses a fully connected layer for both individual and group level training (serialised) with an LSTM layer after the RELU output at both layers. The output of the social cohesion model is used to predict task cohesion dynamics using insighs learnt beforehand. This kind of interactive model performs better than all other previous models and also other baseline models they build for the experiment.

Application: This model could find widespread application in behaviour and organisation studies. A key research area in these fields is finding methods to promote productivity and cooperation among teams. Some studies have found that social cohesion is detrimental to the evolution of cooperation while task based cohesion promotes it. Disentangling these dimensions, studying them separately and observing how these dimensions interact (as done in the paper) can help discover more such phenomena.

Data and Implementation: Studying people in a problem-solving environment would be a great original source for this kind of study (eg: studying groups in breakout room challenges). People could also use readily available data - video recordings from team based game shows like minute to win it or Survivor.

Yaweili19 commented 2 years ago

An modeling processing method for video games based on deep reinforcement learning Runjia Tan; Jun Zhou; Haibo Du; Suchen Shang; Lei Dai https://ieeexplore.ieee.org/abstract/document/8785463?casa_token=ii4ha0nJtTEAAAAA:kq0mS3bynw0pf6XfT-jBU7oq40ixYC87_ZHelLbjMBQEqhTQpaUKmH4bc7BjCt3xVNI_fWqx

The traditional Q-Learning strategy in reinforcement learning can help the agent receive good scores on some simple games with limited states. Meanwhile, if the game model is simple enough, the continuous states can be discretized into finite states, allowing the agent to obtain favorable results. Traditional Q-Learning, on the other hand, will confront numerous challenges in video games with visual output, such as storing excessively complex image states. In this paper, a deep reinforcement learning (DRL) technique called Deep Q-Network(DQN) is used to model video games using their visual output. DQN is derived from Q-Learning and combined with artificial neural networks. Following that, several image processing optimization algorithms and neural network structure are used in the model training procedure.

Image pre-processing of the gamescreen is required before the experiment. After scalingdown and simplifying the details, the resolution of imageis compressed to 96*96. There are three main elements of the game:the car, the road, the space, and the car will score pointsthrough the road while there will be no points on thespace. The 3+2 configuration is selected here,which means the 3 layers of convolutional neural network and the 2 layers of fully-connected neural network. So it need to be converted to grayscaleimage, and it already shows the whole elements neededin the picture.

This article applies deep image learning methods to simplify complex game images into relatively simple, learnable data. Similar social science research, when it comes to complex image learning, can use this method to achieve better model performance.

chentian418 commented 2 years ago

Facial expressions of authenticity: Emotion variability increases judgments of trustworthiness and leadership https://www.sciencedirect.com/science/article/pii/S0010027718302671

This paper is the first to offer investigation of how variability in facial emotion affects social evaluations. Over time, targeted participants displayed either high-variability or low-variability distributions of positive (happy) and/or negative (angry/fearful/sad) facial expressions, despite the overall averages of those facial features always being the same across conditions. The authors found that high-variability led to consistently positive perceptions of authenticity, and thereby, judgments of perceived happiness, trustworthiness, leadership, and team-member desirability. Overall, people do not merely average or summarize over facial expressions to arrive at a judgment, but instead also draw inferences from the variability of those expressions.
As shown in the study, in certain situations, high-variability might therefore be a negative social cue if that person seems unable to control their emotional expressions (thus appearing unstable or “unhinged”). Alternatively, future work might explore how cues of dominance or leadership combine with variability’s influence other social judgments.
The former argument could be further explored in future study, perhaps using methods to amplify variability-related social cues (e.g., using affective voices or other nonverbal behavior paired with faces, and thus also examining these effects using multimodal paradigms. The dataset could be videos of conference calls.

thaophuongtran commented 2 years ago

KnowMeme: A Knowledge-enriched Graph Neural Network Solution to Offensive Meme Detection https://ieeexplore.ieee.org/abstract/document/9582340 1) briefly summarizes the article In this paper, the authors developed KnowMeme, a knowledge-enriched graph neural network solution, to improve the detection of hateful memes on social media by using knowledge facts from human commonsense knowledge. Their method was shown to significantly outperformance the baseline methods in accurately detecting offensive memes according to the evaluation results. KnowMeme was designed to address two main problems with classifying memes and detecting the implicit relationship between visual and textual contents in the meme. The first problem is the challenge of effectively incorporating human commonsense knowledge in the model to encapsulate the implicit meanings of memes contents while the second problem is the challenge of identifying the cross-modal knowledge-based relations between objects in both the visual and textual content of the meme that jointly insinuate offensive messages. 2) suggests how its method could be used to extend social science analysis, This incorporation of knowledge into a deep learning solution to improve classification by addressing implicit relationship interesting and can lead to a wide of application in social science analysis. This would work best in scenario where implicit bias or context exists and cannot be captured by existing models. Some examples might include but not limited to detection of satire in text or audio data, identifying relationships between characters in an image, and examining the tension in a meeting transcript sound file.
3) describes what social data you would use to pilot such a use with enough detail that someone could move forward with implementation. Currently in the final project we are working with a meme dataset to conduct meme detection where we extract the the text from the image via OCR for textual contents and the image for visual contents. As we explore the mem data, we ran into similar challenges of "existing solutions often ignore the implicit relationship between visual and textual contents in the meme". We can attempt to address this problem using the KnowMeme approach from this paper and measure the improvement in performance.

Hongkai040 commented 2 years ago

Responsible AI: Gender bias assessment in emotion recognition

https://arxiv.org/pdf/2103.11436.pdf

This paper studies the gender biases in emotion recognition using facial images as a part of the ultimate goal: building responsible AIs. The dataset they use contains video on which people are displaying emotions and there are six emotions, including happiness, surprise, sadness, disgust, anger and contempt. They trained six distinct neural networks and further analyzed gender biases embedded in the models according three types of fairness definition. These models use popular structures like CNN, LSTM, ResNet ,etc. They found that some models like SENetLSTM are more biased . And such bias trend keeps for true positive and false positive rates. They also found the disparity of identifying different emotions using male and female facial pictures. From the perspective of emotions, classification of Surprise is better for males, Upset and Sad are expressed better by females and Happy is almost identical recognized for both genders.

This paper focuses on gender biases in male and facial pictures in terms of emotion expression. Combing with the paper I summarize last week, which discussed gender biases when using BERT for sentiment analysis, we can literally have a powerful took kit set to identify gender biases in most sentiment analysis scenarios! For example, we may use this set of tools to analyze gender biases/stereotypes in online communities where both visual messages and textual messages may contain gender-related information. Hence we can use those tools to conduct more through analysis comparing to the past.

For me, I may use these methods to analyze gender biases in emotion expressions in movies. So this study may consists of two parts. The first part is detecting gender biases in emotion expressions in movie scripts, which could be done using the method I proposed last week. The second part is detecting gender biases in emotion expressions in videos. We can get all the frames from the video and using other pre-established image recognition methods to identify people as well as their genders in those frames. Then ,we can use a least biased image classification model according to the results from the this paper to identify sentiment of those movie characters. By examining the gender biases in emotion expressions both from text data and image data, the results obtained could be safer and more sound.