Week 9 - Possible Readings

Thinking-with-Deep-Learning-Spring-2022 / Readings-Responses

You can post your reading responses in this repository.

0 stars 0 forks source link

Week 9 - Possible Readings #15

Open lkcao opened 2 years ago

lkcao commented 2 years ago

Post a link for a "possibility" reading of your own on the topic of Digital Doubles & You in the Loop [for week 9], accompanied by a 300-400 word reflection that: 1) briefly summarizes the article (e.g., as we do with the first “possibility” reading each week in the syllabus), 2) suggests how its method could be used to extend social science analysis, 3) describes what social data you would use to pilot such a use with enough detail that someone could move forward with implementation.

isaduan commented 2 years ago

Machine behaviour

Link: https://www.nature.com/articles/s41586-019-1138-y

Machines powered by artificial intelligence increasingly mediate our social, cultural, economic and political interactions. Understanding the behaviour of artificial intelligence systems is essential to our ability to control their actions, reap their benefits and minimize their harms. This paper argues that this necessitates a broad scientific research agenda to study machine behaviour that incorporates and expands upon the discipline of computer science and includes insights from across the sciences. We first outline a set of questions that are fundamental to this emerging field and then explore the technical, legal and institutional constraints on the study of machine behaviour.
Because I am really interested in AI ethics & policy, I find the framework they propose really inspiring. For example, they divide the types of questions into development of behaviour, function, evolution, and mechanisms. These are all social science questions! Currently, the scientists who most commonly study the behaviour of machines are the computer scientists, roboticists and engineers who have created the machines in the first place. These scientists may be expert mathematicians and engineers; however, they are typically not trained behaviourists. I think this is an exciting, emerging field for social sciences.
I am very interested in thinking of a framework to study the behavior of big language models like GPT-3. How can we test and make sense of their behavior, in response to different stimuli?

JadeBenson commented 2 years ago

1) I chose “Collaboration Robots with Artificial Intelligence (AI) as Digital Double of Person for Communication in Public Life” by Evgeniy Bryndin for this week’s possibility reading (https://pdfs.semanticscholar.org/462c/2d5e9992158ff60b1f9b90b7ab603d378d60.pdf). This is a relatively basic article that is part introductory survey on AI, ML, and deep learning techniques and part speculation about how these tools could be used to create digital twins that function as contributors to society. They predict how the future labor market will include cognitive robots that are able to perform menial tasks as well as professional work. Digital twins already allow businesses to test products with less risk and restructure their market, the path for these doubles to become more integrated as social creators too is just beginning. 2) I think this article is interesting for how big it imagines the future use of this technology. It’s as if sci-fi is transforming into reality. The implications of creating robots that can function as their own social subjects is seemingly infinite. If these digital doubles progress to a point where they are able to creatively produce on their own based on their interaction with the social world, we have effectively created new social objects for study. How will they see themselves and the world? How should we interact with them? Will we be able to use them to study the world in the same ways as when they were inputs and architectures that could be easily modified, or is using them akin to unethical scientific experimentation? Digital twins might not only be used for social science research but produce their own research and analysis as well. 3) If we progress to the stage where digital doubles are able to produce their own professional work and act as social subjects, my first curiosity is to just ask them about what they think about us and to perform their own social science research on us. From their perspective, what are the problems that most urgently need to be addressed and how could we effectively tackle them? How could we use their plasticity to improve our own existence? These sorts of questions are so exciting, and yet I do worry about what will happen when/if technology advances to this point and how this will drastically change our own society and how we will protect the beings we created. We’ve written such beautiful sci-fi about this topic and I hope we will learn from it about how to proceed thoughtfully from here.

pranathiiyer commented 2 years ago

Multimodal Sentiment Analysis To Explore the Structure of Emotions https://dl.acm.org/doi/pdf/10.1145/3219819.3219853

This paper proposes a multimodal approach towards sentiment analysis. It uses neural networks and and natural language processing to use visual and textual cues to build a Deep Sentiment Model that can identify the emotion of a user on Tumblr. The authors fine tune a pre trained Inception model for the specific emotion inferring task, use Glove (which is mapped into an RNN0, to extract semantics of text, and finally use a dense layer in their model which combines information from both these modalities. They have a final softmax layer that gives a probability distribution over a possible set of emotion word tags.
I think most of social media data today is multimodal. Memes, gifs and text are all instrumental in communication on these platforms. While understanding text and image separately itself offers significant information about social data, understanding their significance in tandem enables us to extrapolate key information about modalities that are widely prevalent as communication media in the online world today, For instance, most handles on twitter, instagram, and subreddits on reddit are flooded with memes and gifs which are truly meaningful only when their meaning is understood with text and image together. Such a model can help us analyse problems of social science that involves data such as these.
My team and I are currently working on a dataset of memes and we are trying to explore multimodal ways that can be used to understand what makes a meme a meme. This method could be a potential approach that we could adapt on our dataset.

borlasekn commented 2 years ago

Xie, L., Feng, X., Zhang, C., Dong, Y., Huang, J., and Liu, K. (2022). Identification of Urban Functional Areas Based on the Multimodal Deep Learning Fusion of High-Resolution Remote Sensing Images and Social Perception Data. Buildings. 12(5). 556. https://doi.org/10.3390/buildings12050556

This paper proposes a Multimodel Deep Learning approach for urban land use. They attempt to identify urban functional areas by combining remote sensing images with social perception data. These modes of data, however, have differences in data forms and differences in sources. Because existing methods are limited in creating comprehensive understandings of the characteristics important in both of the forms of data, they propose a multimodal deep learning method that uses an attention mechanism to fully utilize the data features present in each dataset. They ultimately extract features sequentially among two different dimensions (channel and spatial) and achieve a recognition accuracy of functional areas of 93%.
This paper can be used to extend social science analysis particularly as it relates to combining spatial data with opinion based data. This would apply in many different situations, especially as related to economic development. A lot of development problems have many different data sources. Therefore, the ability to build models that are able to capture the complexity of these issues in order to best propose actionable solutions is very important. These models can also be used to study the patterns of gentrification and development such as the building of hospitals, schools, etc.. As more spatial data is available for developing areas, these models can help find best practices as well.
I would like to use these types of models to perhaps combine spatial data with data from schools to try to identify the best areas to build schools. This is a problem particularly important in urban areas where they must determine the best places to build new schools, but they must take into account not only social perception data (on who would attend the schools) but spatial/zone data as well as economic data about funding. Taking this model a step further and using three dimensions could identify urban functional areas for educational spaces.

javad-e commented 2 years ago

Suel, E. & Bhatt, S. & Brauer, M. & Flaxman, S. & Ezzati, M. (2021). Multimodal deep learning from satellite and street-level imagery for measuring income, overcrowding, and environmental deprivation in urban areas. Remote Sensing of Environment, https://doi.org/10.1016/j.rse.2021.112339.

Suel et al. combine satellite images and street view images to approximate several economic variables. Combining the two sources of data is performance-enhancing as each source captures different information. For each tile in their map, the researchers have five images from two categories. Four street view images from different angles and one satellite image. The images are used to predict average income, population density, and environmental deprivation in London. The multi-modal model compared to the best unimodal alternatives increases accuracy by 20, 10, and 9 percent for income, overcrowding, and living environment respectively. The researchers also propose a U-Net architecture to predict the results at a higher resolution and use sparse street-level images. The U-Net approach is novel and could be very useful in this context, however, it does not perform as well as the former architecture.

Using deep learning to analyze satellite images is becoming more and more popular in urban studies. Compared to alternative methods of answering research questions, this approach is less expensive, more accessible for developing regions, and more capable of identifying hidden patterns. Suel et al. are combining satellite images with street view images and this has many applications in social sciences. However, satellite images can also be combined with other types of data. For example, in our final project, we combine satellite images with tabular economic data, and points of interest network data.

The images used in this study are publicly available through Google. However, the conducted analysis is at a pixel resolution of 3 m2. One can run a similar project using satellite images and street-level images at a lower resolution. Moreover, instead of combining street images from 4 different angles, one could only focus on covering one side of the street.

thaophuongtran commented 2 years ago

Multimodal deep learning models for early detection of Alzheimer’s disease stage Janani Venugopalan, Li Tong, Hamid Reza Hassanzadeh & May Wang https://www.nature.com/articles/s41598-020-74399-w.pdf

1) briefly summarizes the article (e.g., as we do with the first “possibility” reading each week in the syllabus),

In this study, the authors uses the multi-modal Alzheimer’s disease data to advance Alzheimer’s disease stage prediction by using deep learning to combine imaging, EHR, and genomic SNP data for the classifcation of patients into control, MCI, and Alzheimer’s disease groups. The model used stacked de-noising auto-encoders for EHR and SNP data respectively, and novel 3D convolutional neural networks to train MRI imaging data. The networks were separately trained for each data modality, then combined using diferent classifcation layers, including decision trees, random forests, support vectors machines, and k-nearest neighbors. The data comes from Alzheimer’s Disease Neuroimaging Initiative (ADNI) database, particularly the ADNI37 dataset that contains SNP (808 patients), MRI imaging (503 patients), and clinical and neurological test data (2004 patients). In order to address a the lack of well-defined methods for interpreting the deep models, the authors developed novel perturbations and a clustering-based approach for fnding the top features contributing to the decision. They found that integrating multi-modality data outperformed single modality models in terms of accuracy, precision, recall, and meanF1 scores. In addition, they identifed hippocampus, amygdala brain areas, and the Rey Auditory Verbal Learning Test (RAVLT) as top distinguished features given their novel approach.

2) suggests how its method could be used to extend social science analysis,

Not only that their framework can use to applied on identification and analysis of other disease but it can also be applied on economic research by studying audio files, text content, images to provide realtime accurate predictions of output, productivity, and other measures. For example, satelite images and tables data have been used to predict economic acitivity and gross output of the U.S. In addition to economics, this can be used in a plethora of applications, such as customer services, cyberbully detection, and fraud detection.

3) describes what social data you would use to pilot such a use with enough detail that someone could move forward with implementation.

There is a dataset by Francesca Gasparini, Giulia Rizzi, Aurora Saibene, and Elisabetta Fersini for their paper "Benchmark dataset of memes with text transcriptions for automatic detection of multi-modal misogynistic content", which includes 400 misogynistic memes and 400 non-misogynistic memes. The framework can applied on this multimodal data, image and text, to automatically detect misogynistic content.

linhui1020 commented 2 years ago

https://yadda.icm.edu.pl/baztech/element/bwmeta1.element.baztech-eec12137-e005-4a86-ab1b-9fc0da6ebb79

1) briefly summarizes the article (e.g., as we do with the first “possibility” reading each week in the syllabus) This study employs the digital doubles for the purpose of control and distribution in power systems. Digital doubles refer to a mapping of a physical object to a digitalized one, which has similar development pattern as the reality one. In the power system, the authors regard that the number of state variables in a digitalized one is equal to the number of independent energy storage elements in the system. By employing the DT system, the power system could reached the intended temperature with minimum cost and energy by inputting parameters such as temperature rate, maximum and minimum temperatures. Digital double makes it possible to save energy cost and design algorithm for managing the renewable energy.

2) suggests how its method could be used to extend social science analysis Since digital doubles could be used for simulates the operation of a power system. Could it be used for crime analysis? For example, creating a digital double of a community or neighborhood by inputting past community based features (e.g. population, crime rate, and other variables) and update with real time data through internet of things, so that we could track the development of a neighborhood and the police could do some interventions.

3) describes what social data you would use to pilot such a use with enough detail that someone could move forward with implementation. I am wondering whether we can employ digital digital doubles to replace the traditional method which investigates causal inference (like Difference in difference and etc.) This traditional method requires a comparative neighborhood or city and we usually do this through synthesis control. If we have a digital doubles, would it be a perfect comparative group to investigate this. In addition, I would recommend referring to previous studies about policy effectiveness for reproducibility and testing.

BaotongZh commented 2 years ago

RMDL: Random Multimodel Deep Learning for Classification Link: https://arxiv.org/abs/1805.01890

1) The continually increasing number of complex datasets each year necessitates ever-improving machine learning methods for robust and accurate categorization of these data. This paper introduces Random Multimodel Deep Learning (RMDL): a new ensemble, deep learning approach for classification. Deep learning models have achieved state-of-the-art results across many domains. RMDL solves the problem of finding the best deep learning structure and architecture while simultaneously improving robustness and accuracy through ensembles of deep learning architectures. RDML can accept as input a variety of data including text, video, images, and symbols. This paper describes RMDL and shows test results for image and text data including MNIST, CIFAR-10, WOS, Reuters, IMDB, and 20newsgroup. These test results show that RDML produces consistently better performance than standard methods over a broad range of data types and classification problems.

2) The classification task is an important problem to address in machine learning, given the growing number and size of datasets that need sophisticated classification. This paper shows that deep learning methods can provide improvements for classification and that they provide flexibility to classify datasets by using a majority vote. The proposed approach can improve the accuracy and efficiency of models and can be used across a wide range of data types and applications.

3) This paper is extremely helpful for our project. We are now trying to combine the image data with the object detection data to improve our model's performance. The object detection data is a table data including the number of houses in one area, the area of each house, and the length of that area, and the image data is just the RGB tenser. The RDML gives us good guidance to choose the best deep learning structures and build ensembles of deep learning architectures.

sabinahartnett commented 2 years ago

Rumor detection on social media using hierarchically aggregated feature via graph neural networks Link: https://link.springer.com/article/10.1007/s10489-022-03592-3

This paper combines network and text analysis to detect rumor proliferation on social media sites. The novelty of this paper comes from their implementation of Hierarchically Aggregated Graph Neural Networks (HAGNN) - a task focused on capturing different granularities of high-level representations of text content and fusing the rumor propagation structure. By combining this CNN-based semi-supervised Graph Convolutional Networks approach with the GGNN (Gated Graph Neural Networks), their rumor detecting model can detect contextual word relationships in documents as well as inductively learn new words. This paper proposed a novel dual-grained feature aggregation graph neural network (HAGNN), which operates on GCN and GNN.
This task is especially important as the line between (the traditional) 'reputable' news sources and crowd-sourced content creation blurs. (to be clear, there is a lot of value and fact-based reporting that happens on social media sites but) these 'rumors' are often able to proliferate and branch (spur the creation of additional content) before they are detected and intervention is possible. This model has a significant increase in detection accuracy that is promising for the future of rumor detection.
as I brought up in the prior point - I think early intervention is the most important potential implementation for these models - although the accuracy metric for rumor detection is above the current baseline, I would be interested in the speed with which (at which stage of rumor creation) this model is able to recognize. Similar to a game of telephone - where the original post starts as a 'rumor' - it would be interesting to track the model as the rumor is reformed through various users.

zihe-yan commented 2 years ago

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (pp. 8748-8763). PMLR.

This paper introduces a model that learns from both raw texts and images, the CLIF. The significance of this model is to solve the problem of limited supervision. Before the CLIF, most training data in computer vision faced the problem of limited and predetermined categories, which could potentially bias the data. The authors firstly perform a simple pre-training task in which the caption goes with which images and learn the SOTA representation from scratch. After pre-training the model, text data is used to enable the zero-shot transfer of the model to the downstream. The paper proves that the CLIF is both more efficient and flexible across the dataset.
It's an attractive idea that we are no longer bounded in the predetermined categories of classification. The introduction of the CLIF can make more sources on the Internet available for research purposes. For example, in the facial recognition tasks, this model can be applied to tasks such as celebrity identification and hopefully reach a higher accuracy by drawing more data from the Internet.
One great thing about the model is that it does not require a highly cleaned dataset and it's efficient to implement. So if I want to implement a task in celebrity identification, I will need only two sets of raw data, which can be attained from the web: (1) people's descriptions of the celebrity (probably some related comments) and (2) the photos of these celebrities.

yhchou0904 commented 2 years ago

Visual Dialog The paper proposes a new artificial intelligence task – Visual Dialog. The task integrates several regular machine learning tasks, including realizing visual contents, converting content from vision to language, text-based question answering, and also conversational modeling and chatbot. The main task is to make the robot answer arbitrary questions related to specific visual content or say, an image. In the paper, the authors introduce a family of neural encoder-decoder models for this specific task with 3 encoders, late fusion, hierarchical recurrent encoder and memory network, and 2 decoders, including generative and discriminative ones. In addition, the project also proposes a protocol corresponding to the visual dialog task to evaluate the answer given by the model. To sum up, the paper presents the first visual chatbot and opens a new field for people to dive into visual intelligence further. This visual dialog task combines many different types of traditional machine learning tasks, from image content, and text learning, to conversation generating. Although generally the social science research that we are used to seeing are not involving such complicated tasks and we usually would like to conduct research with data we’ve obtained and arranged into an accessible format, this visual dialog still provides us a new possibility to collect different kinds of data. Furthermore, it might be possible for us to bring some more realistic situations into lab experiments and avoid oversimplifying some real circumstances just for the feasibility of the experiments. Just as described in the paper, data that suits this task is the ones that combine visual content, dialog history, and questions related to the images together. So, for example, it might be possible to use some data from social media, where people often discuss the content in the pictures.

ShiyangLai commented 2 years ago

A Review on Explainability in Multimodal Deep Neural Nets https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9391727

(1) This paper extensively reviews the present literature to present a comprehensive survey and commentary on the explainability in multimodal deep neural nets, especially for the vision and language tasks. Several topics on multimodal AI and its applications for generic domains have been covered in this paper, including the significance, datasets, fundamental building blocks of the methods and techniques, challenges, applications, and future trends in this domain. (2) The authors have given a quite comprehensive review on the application of multimodal. They also explained possible options for multimodal data fusion in detail, which is the key component of multimodal learning. With these techniques, we are able to combine information in different formats to do one single task and interpret them independently. For example, we can try to identify the racial biases learned by multimodal by interpreting the biased parts in the voice part and image part separately. (3) To do so, we can feed YouTube videos to a classification multimodal to predict the popularity of the YouTuber. After training, we can use XAI techniques to reveal how the model makes the judge. Also, we can test whether its prediction is based in terms of race, gender, voice feature, accent, etc. Since all types of biases are interwoven together to make the final biased output, it will be interesting to mine out the hidden mechanism of multimodal to learn bias.

min-tae1 commented 2 years ago

Research on this paper suggests classifying disaster twits using state-of-the-art models in both text and image classification. It first aims to discover if a tweet is informative or not and then classifies whether responses are required. The novel method feature extraction of textual data and preprocessing of image corpora. Training several classification models to train and predict their output, as well as comparing their results followed suit. Researchers argue that their proposed method fared better compared to other existing models, thus proving overperformance compared to other baseline models. Multimodal deep learning methods can be employed in diverse areas such as the classification of movie genres. For social analysis, deep learning techniques incorporating both text and image could also be employed in the study of communication studies. Newspaper articles typically involve both text and image. A deep learning model that could understand both forms of information could be helpful in analyzing changes in the media. For instance, if we could successfully classify the ideologies of articles, such a method could be employed to understand how newspapers change their ideologies. Furthermore, a newspaper’s shift in attitude towards certain events could also be understood with a deep learning model that incorporates both text and image. Reddit data could also be used to understand shifting ideologies within online communities. The Reddit pushshift data involves posts that usually involve images. Understanding posts and classifying them based on ideology would thus require a multi modal deep learning technology. Also, it would be interesting to see which posts played a major role in shifting ideologies. Posts that have different ideologies at the same time would also be a source of interest. If a text of a post shows positive attitude towards QAnon, while the image does not, it would be interesting to check out why that is the case.

yujing-syj commented 2 years ago

Gender Detection on Social Networks Using Ensemble Deep Learning https://link.springer.com/chapter/10.1007/978-3-030-63128-4_26#Abs1

This paper aims to detect gender by the posts and network on Twitter. The problem with the existing methods is that as the volume of social media has increased, the performance of traditional supervised classifiers has degraded. In this case, using the Random Multimodel Deep Learning (RMDL) could be a very strong technology to classify the gender of the user from different feature spaces.
This paper is based on Random Multimodel Deep Learning (RMDL) for text and document categorization in which we used two different feature extraction and ensemble deep learning algorithms to train this model. The advantage of this is that it could combine the different kinds of deep learning algorithm together to do the classification or prediction. Here, the authors contain multi Deep Neural Networks (DNN) and Convolutional Neural Networks (CNN). The model is performed on the tweets and it helps to predict the gender and language at the same time. This model is very useful when we try to study the text data. It can also be applied to multiple data types at the same time.
For me, I would like to apply the model to the table data in our final project. The fact is that we have a table model with econ data and a table model with the result from object detection. It’ s inappropriate to combine the data together. The better way is to synthesize these two models together using RMDL. This could be a possible solution for us.

y8script commented 2 years ago

Explain Images with Multimodal Recurrent Neural Networks https://arxiv.org/abs/1410.1090

This model isn't exactly a multi-modal embedding approach. However, it proposes an interesting integration of textual and image data when trying to generate word description based on previous words and image data. The model is proposed as a multi-modal recurrent neural network, which is a combination of a recurrent neural network for sentences and a convolutional neural network for images. Then the two networks interact with each other on a multi-modal layer. The model is effective in explaining contents of the images with texts. Moreover, it can be used on a retrieval task for sentences or images.
This paper gives an example for integrating sequential and non-sequential models in an effective way. This can be extended to other sequential processing scenarios, and to other language models that cares about the sequence of words for text generation. For example, a auto transcript generator can try to implement video image data to help represent the context of a scenario, which can be indicative of related words. By using this information, the model can make good transcripts when the sound data itself is ambiguous.
For representing a speaker in different contexts, which is the topic for our final project, we can try to integrate convolutional neural network for images with a sequential neural network that encodes document texts. By providing images about a context, we can train better language neural networks that represent sentences differently based on the contextual information extracted from images(for example, background and facial expression of the speakers).

Yaweili19 commented 2 years ago

Provided VilBERTr reading

Authors present ViLBERT (short for Vision-and-Language BERT), a task-independent model for learning joint representations of image content and natural language. The popular BERT architecture is extended to a multi-modal two-stream model, with separate streams processing visual and textual inputs that interact through co-attentional transformer layers. They pretrain our model using two proxy tasks on the large, automatically collected Conceptual Captions dataset, and then use only minor changes to the base architecture to transfer it to multiple established vision-and-language tasks– visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval. They see significant improvements across tasks when compared to existing task-specific models, with all four tasks reaching state-of-the-art. Our work represents a shift away from learning vision and language groundings solely as part of a task.

Inspired by BERT, the authors develop analogous models and training tasks to learn representations of language and visual content from paired data. They consider jointly representing static images and corresponding descriptive text. Their model which we call ViLBERT is shown in Fig. 1 and consists of two parallel BERT-style models operating over image regions and text segments. Each stream is a series of transformer blocks and novel co-attentional transformer layers (Co-TRM) which we introduce to enable information exchange between modalities.

Obviously, the model proposed by the author has broad generality and can be used in various researches, just like the BERT model it is based on. As long as natural language and pictures can be encoded in an appropriate form, they can be applied to the social science research we care about. Perhaps more worthy of consideration than this is what kind of research questions or databases would use both types of data, and whether combining the two would yield better results than separate studies.

Hongkai040 commented 2 years ago

paper: UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection https://arxiv.org/pdf/2203.12745.pdf

In the paper, the authors focused on the research of Finding relevant moments and highlights in videos according to natural language queries. They proposed a framework called Unified Multi-modal Transformers (UMT), capable of realizing , joint moment retrieval and highlight detection while can also be easily degenerated for solving individual problems. The contribution of the paper is they proposed the first scheme of visual-audio learning for either joint optimization or the individual moment retrieval task. They tackled the problem by using a novel query generator and query decoder.

The purposes moment retrieval and highlight detection are quite obvious. It helps us save time when we want to look into videos to find a specific clip. We can even see an embarrassing application of similar techniques on iPhone: moments! However, I think this technique has the potential of being repurposed for social science studies, as there’re now in-numerous videos on the Internet. However, unlike having matured way of querying and extracting specific parts of data from texts for analysis, extracting useful information from videos could be more difficult. How can we quickly find some useful parts of information we want to study in the videos at large-scale? This UMT may provide a new approach. We can use UMT to say extract clips of, say pedestrian behaviors, from street CCTVs for psychology or sociology studies. Or, we can make use of this tool to analyze videos on popular media platforms.

I can think of two scenarios! One is we can use this approach to study group behaviours by quickly identifying groups of people on streets through CCTV records. Again I feel like there’s an high possibility of people in New York running red light(chuckle). How they behave accordingly to people around? Since there’re usually CCTVs at crossroads, we can use the UMT to help us extract useful clips of those moments and analyze(of course the precondition is that we have to have those videos first, which could be an issue:)). Another scenario is that we can use it for studying online social cultures. For example, we can use the technique to efficiently extract the same famous person(may first filter by using titles) to study how they perceived on social platforms.

Emily-fyeh commented 2 years ago

Kiela, D., Firooz, H., Mohan, A., Goswami, V., Singh, A., Ringshia, P., & Testuggine, D. (2020). The hateful memes challenge: Detecting hate speech in multimodal memes. Advances in Neural Information Processing Systems, 33, 2611-2624.

This work proposes a new challenge set for multimodal classification, focusing on detecting hate speech in memes. Only multimodal models can succeed on difficult examples, which are also called benign confounders that were added to the dataset; it is hard to rely on unimodal signals. “The [hateful speech meme] task requires subtle reasoning, yet is straightforward to evaluate as a binary classification problem.” The authors compare baseline performance numbers for unimodal models with that of the multimodal models having various degrees of sophistication. They actually find that state-of-the-art methods perform poorly compared to humans.

The authors first provide their definition of hateful memes, excluding the cases of attacking individuals/famous people and terrorists. Second, they reconstructed the memes using Getty Images, replacing the original data without losing the information and avoiding any potential noise from optical character recognition. Third, they had the memes annotated by three humans according to the level of hatefulness. Fourth, benign confounders are the “contrastive” or “counterfactual” examples that “are a minimum replacement image or replacement text that flips the label for a given multimodal meme from hateful to non-hateful”. By comparing two image encoders and combining them with the textual encoder with the simple model baselines, the study finds that the multimodal models do better, and among them, the more advanced the fusion, the better the model performs.

Though still having a huge gap between human annotators, I think the overall framework of the hateful meme task can also be used on capturing other nuances in texts.

mdvadillo commented 2 years ago

TABERT: Pretraining for Joint Understanding of Textual and Tabular Data https://arxiv.org/pdf/2005.08314.pdf

TaBERT as a model adds to the previous pretrained language models based on natural language processing by allowing us to train those models with NL text and tabular data, instead of just free-form NL text. This allows us to expand the set of tasks we can perform on the data, to include things such as semantic parsing over structured data. TaBERT is a pretrained model that learns representation from natural language sentences and tabular data. It can be used to extract and represent features.

These models can be used to integrate other labels and categorical and numerical variables in the analysis of a text corpus. That can allow us to test on multiple labels, do prediction, and get a new understanding of the data presented.

This tool and the analysis it allows is similar to the one we are using for our final project. We have tabular data on presidential speeches, with a lot of categorical variables that add descriptions/characteristics to each speech, and then the speeches’ transcripts themselves are stored in column (variable) within the tabular data. I tried applying the model as well for my homework this week, and was able to train the data on the corpus of transcripts plus another 5 categorical variables and a numerical variable. The model predicts the speaker based on their speech mannerisms and the additional categorical data.

ValAlvernUChic commented 2 years ago

DISARM: Detecting the Victims Targeted by Harmful Memes

This paper aims to use multimodal deep learning on memes to firstly identify whether a meme is harmful and whether that the meme was made with the intention to harm the target. To do this, they propose DISARM (Detecting vIctimS targeted by hArmful Memes), a framework that uses NER and person identification to detect all entities a meme refers to. The authors undertake three test setups that "correspond to entities that are a) all seen while training, b) not seen as a harmful target on training, and c) not seen at all on training". In the end, the results suggest that DISARM is interpretable and generalizable and outperforms several other models at decreasing teh relative error rate for harmful target identification.

The results from this paper can be applied to social science questions that aim to study how memes could be used insidiously to facilitate cyberbullying. Taking this question further might involve trying to establish the range of hateful intentions Beyond social science, this would be especially applicable to internet policy tht aims to regulate harmful memes.

There are datasets online that have hateful memes. These can be measured against other exogenous or objective measures like the level of hateful comments concerning the same topic. This could let us see if there are any discrepancies between how a meme can be hateful and how text can be hateful