Week 5 - Possible Readings

Thinking-with-Deep-Learning-Spring-2022 / Readings-Responses

You can post your reading responses in this repository.

0 stars 0 forks source link

Week 5 - Possible Readings #11

Open lkcao opened 2 years ago

lkcao commented 2 years ago

Post a link for a "possibility" reading of your own on the topic of Deep Learning with Text [for week 5], accompanied by a 300-400 word reflection that: 1) briefly summarizes the article (e.g., as we do with the first “possibility” reading each week in the syllabus), 2) suggests how its method could be used to extend social science analysis, 3) describes what social data you would use to pilot such a use with enough detail that someone could move forward with implementation.

borlasekn commented 2 years ago

Agüero-Torales, M.M., Abreu Salas, J.I., López-Herrera, A.G. (2021). Deep learning and multilingual sentiment analysis on social media data: An overview. Applied Soft Computing. 107.

LINK: https://www.sciencedirect.com/science/article/pii/S1568494621002969

(1) This paper reviews and addresses the shifts in using Multilingual Sentiment Analysis research on social media data by highlighting a focus in the linguistic ideas of code-switching and cross-lingualism. These ideas have previously been difficult to apply Sentiment Analysis models on because these models are built with the English language. Despite the shifts towards highlighting the more complex, or harder to model, forms and methods of multilingualism, research lacks the usage of complex architectures and Deep Learning approaches. This paper then attempts to show how Deep Learning approaches can be effectively applied to multi-lingual aspect based sentiment analysis. They present the idea of producing language-independent Sentiment Analysis models that are capable of handling data containing multiple languages or containing code switching between languages. (2) Because of this language-independent model possibility, it becomes possible to capture cultural displays of sentiment in various settings. This means that in a world where many languages are presented within diverse societies, the sentiment of larger groups of people, not limited by language, becomes possible. Their finding that there has been a lack of complex techniques used in creating these Multilingual Sentiment Analysis models suggests that there is more discovery possible, especially within Deep Learning that would allow people to learn even more from the multilingual data as they build these models than previously possible. They do present some work that has used CNNs (such as Word-Character CNNs), GANs, LSTMs, DANs, and BiLSTMs, but in the overall literature of Multilingual Sentiment Analysis, the applications of these methods is limited. These neural network models, if used more, could reveal more about the underlying structures of how language use impacts sentiment as well as presenting the overall sentiment of the text of interest. (3) One area of interest that could be addressed with Deep Learning through Multilingual Sentiment Analysis might be looking at social media posts connected to a particular law or bill that was being passed. One would be able to look beyond English and see other languages as resources rich, because these models can potentially capture more from the data than Multilingual Sentiment Analysis models used without Deep Learning.

ValAlvernUChic commented 2 years ago

Link: https://dl.acm.org/doi/pdf/10.1145/3321128

1)This paper extends prior studies on the computational linguistic analysis of Singlish - a low-resource creole based on English and amalgamated with multiple other languages including, Malay, Mandarin, Tamil etc. Singlish has historically been a difficult language to analyze on a computational scale due to textual unavailability which has made dependency parsing and POS methods largely ineffective on it. For example: English - "Why are we always going to coffeeshops for our dates?", Basic Singlish - "When we date we always eat at the coffeeshop (one), Advanced Singlish - "Dey (tamil), wo men (mandarin) paktor (cantonese) always makan(malay) at kopitiam (malay/hokkien) one (just an expression). This study expands on previously created treebanks and used neural stacking models to integrate English syntactic knowledge to improve POS tagging and dependency for Singlish, resulting in Singlish POS tagging and dependency parsing accuracies increasing to 91.16% and 85.57% respectively. 2) Methodological insights from the study could be extended to other low-resource creole languages for cultural analysis. In Singapore especially, greater capabilities in this could enable important discursive and socio-cultural analyses of community behaviour and broad rhetoric which has often been limited to qualitative interpretations. While this type of close reading analysis is important (I often prefer it), it limits researchers to a small corpus. Extended to other creole languages, it could enable similar cultural analysis within their language domains. While English is our first language, the use of it online is often interspersed with Singlish which makes traditional methods ineffective but this new insight especially pertinent. Multilingual BERT does have Singlish capabilities but studies using it have relied on this study's treebanks to optimize its dependency parsing and POS tagging accuracies. 3) One place I have been interested in is HardwareZone which used to be a forum to discuss cars but morphed into a 4chan-like community where racist, xenophobic, and misogynistic rhetoric live and thrive.

isaduan commented 2 years ago

TimeBERT: Enhancing Pre-Trained Language Representations with Temporal Information

Link: https://arxiv.org/pdf/2204.13032.pdf

1) This paper introduces TimeBERT, a novel language representation model trained on a temporal collection of news articles via two new pre-training tasks, which harness two distinct temporal signals to construct time-aware language representation. Time is an important aspect of text documents: for example, in temporal information retrieval, where the temporal information of queries or documents need to be identified for relevance estimation. Event-related tasks like event ordering, which aims to order events by their occurrence time, also need to determine the temporal information of events. The proposed TimeBERT consistently outperforms BERT and other existing pre-trained models, with substantial gains on different downstream NLP tasks or applications for which time is of importance.

2) Social world is all about time! Events happened at particular times. Actions are performed, consequences lingered, and social structures transformed, all through time. As social scientists we are especially interested in tracing the occurrences, the consequences, the changes! I can imagine two immediate downstream applications - Event Occurrence Time Estimation and Document Timestamp Estimation - can be useful in (1) extracting and generating time labels for data from texts; (2) exploring the developments of social events and identifying the relations between events.

3) I am interested in identifying how people remember and talk about the past. For example, I am working on a Chinese foreign policy corpus on China-Japan relations, what pasts with Japan do Chinese scholars emphasize, why? Is it the war, the normalization of the bilateral relations? When is the past fading away, the present pragmatic interests, e.g. in trades, dominate? TimeBERT can be used to detect temporal information in those documents. A key limitation here, however, is that TimeBERT performance is still not high (~35% accuracy for event occurrence) and may work worse for the Chinese languages.

yujing-syj commented 2 years ago

Developing a Twitter-based traffic event detection model using deep learning architectures https://doi.org/10.1016/j.eswa.2018.10.017

The paper is trying to build a deep learning model to detect the traffic incidents and monitor traffic conditions. Instead of using the traditional method, the bag-of-words, to numerical vectors, researchers use word-embedding model which also keeps the semantic relationship between words. After that, CNN and RNN are used on the top of word-embedding model to detect the traffic accidents, classifying the vectors into three categories: 1) non-traffic, 2) traffic incident, and 3) traffic information and condition. The results show that this new method increase the accuracy of the classification compared to state-of-the-art methods. With the geocode extracted by the model, the location information could be automatically sent to users and police to decrease the potential traffic issues.
This paper has very complete pipeline of classifying and extracting information from social media data, which is very popular used in social science study. Instead of using the existing traffic dictionary, the pros of twitter are that itself could provide a more suitable dictionary to detect the traffic related information. This could be a beneficial for me if I want to classify the topic with social media data. In this paper, the authors also combine the CNN and LSTM together to do the classification. However, the result isn’t always appealing compared to the simple CNN or LSTM model when they deployed the word2vec or FastText. For my future study applying deep learning, testing on different combination of the model is essential but that doesn’t mean the complex model is the best.
Since our group once considered about using the satellite image to detect the traffic accident, we can also use social media real time data to collect the real time information or comments about the road condition. This is a good way to combine the text in our model if we want to stick to this topic. The method could also be applied to report the real time crime cases. With deep learning, the most convenient application is that with the text data only, we could develop a complete system to alert to the society about some accidents or emergency that could facilitate the urban management.

javad-e commented 2 years ago

Link: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9342742

Sarcasm Detection in Newspaper Headlines The paper begins by saying “no one EVER uses sarcasm in online comments or headlines”. Most humans detect this to be a sarcastic statement. (disclosure: I had to read the next line of the paper to make sure the authors were not serious!) The goal of this project is to build a sarcasm detector for newspaper headlines. The researchers use a dataset of ~27000 headlines. The proposed approach includes an embedding matrix, a convolutional layer, a max-pooling layer, and a dropout layer. Finally, the outcome of the mentioned stages is passed to a bidirectional LSTM. This simple neural network design achieves an accuracy rate of 86 percent. I should mention that this paper is based on a school project and the analysis is not as deep as expected. Moreover, there have also been other approaches such as the ones proposed by Kumar et al. (2019) and Zhang et al. (2016). There had been other attempts in detecting sarcasm, however, unlike the mentioned examples, they usually relied on contextual data, too. Given the new headlines about AI misclassifications, I thought this is an interesting case to consider. Nuances, such as sarcasm in text, are sometimes even too difficult for humans to detect and misclassification of these nuances could have extremely serious consequences. I believe the success of such a simple model is promising for more independent AI in the future. Today, we are, rightly so, worried about issues such as AI’s failure in object detection or the risk of adversarial attacks. However, similar focused projects and high-quality data could resolve some of these concerns. The data used in this project is publicly available. Other researchers have implemented sarcasm detection algorithms using Twitter data. One could also think of similar analyses being conducted using data from advertisements, online news articles, and Reddit/Facebook posts. Examining comments on review platforms is another great source. Finding sarcastic reviews is perhaps both easier and more accurate because sarcastic statements would have a sentiment opposed to the associated rating.

yhchou0904 commented 2 years ago

Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora The method Labeled LDA is a revised version of LDA (Latent Dirichlet Allocation) that deals with the task of topic models. LDA model generates summaries of topics in terms of a discrete probability distribution over words for topics. Though LDA could help us realize the topic in a document, it could not perform well on multi-labeled corpora. At the same time, credit attribution occurs because lots of texts and pages tagged by users have multiple tags. To address this problem, L-LDA adds a constraint to the original LDA by defining the one-to-one correspondence between LDA’s latent topics and user tags, and L-LDA also combines the strength of Multinomial Naive Bayes. L-LDA could not only solve the credit attribution problem of multiply labeled documents but also improve the interpretability of model results. In experiments, L-LDA also outperforms SVMs in different multi-label classification tasks. The L-LDA could be used as LDA in topic model tasks, especially in multi-labeled documents. For example, when using LDA, since it is an unsupervised method, we could only analyze the result of clustering but sometimes are not able to fit the labels given by users. Also, we could use L-LDA to extract snippets of a document that could describe the contents from the perspective of a particular tag. With these snippets, we could further understand how the model learns the pattern from texts for different tags, which could improve the interpretation. To sum up, the L-LDA could not only help us to explore multi-label documents from different kinds of content but also increase the interpretability of the model, which is one of the most important aspects social scientists would like to focus on. L-LDA could be applied to various data resources, including news, web posts, overviews, or any other content with multiple labels.

zihe-yan commented 2 years ago

Cross-Lingual Word Embedding Refinement by L1 Norm Optimisation

Link to the article: https://arxiv.org/abs/2104.04916

Summary: This article presents a way of optimizing the cross-lingual word embeddings. Following the reasoning that the L2 norm has flaws, which is that the model is sensitive to outliers, this paper tries to propose a post-processing step to reduce the loss according to the L1 norm. As a post-processing step, this paper applied the tools to different baseline models such as MUSE, VECMAP, etc, and performed both supervised and unsupervised tasks. By conducting experiments cross-linguistically, models refined using this paper’s method proved to be generating a higher accuracy score.
Suggestion for social analysis: In a broader sense, such cross-linguistic word embedding can be used in a comparative social media analysis. I hear more and more people complain that they no longer feel happy when they read social media posts, regardless of their language. One reason is probably that social media today is actually becoming countless echo chambers. People argue with each other using the same term although this term is understood differently. Word embeddings are good at capturing the contextual elements in the sentence. Therefore, it will be useful to be applied in the analysis of social media texts. How can we compare different echo chambers? How is a political term given its meaning in this context, not in the other context? For example, when people are using the word democracy, do they apply it in the same context across languages?
Data: I’m particularly interested in how people are talking about politics online. I think one way to implement this is to have different language social media text data centering on specific political terms. For example, the word democracy may mean differently in the context of mainland China and the United States. Also, political terms such as election, liberalism, left-right may be drastically different in not only meaning but also sentiments in authoritarian and democratic countries

pranathiiyer commented 2 years ago

Link: https://link.springer.com/chapter/10.1007/978-3-319-76941-7_11

The paper uses deep learning based models and transfer learning to assess cyberbullying across three platforms--Formspring, Twitter, and Wikipedia. They performed experiments using large, diverse, manually annotated, and publicly available datasets for cyberbullying detection in social media.They try four models --CNN, LSTM, BLSTM, and BLSTM with attention, and three methods for initialising their embeddings random, GloVe, and SSWE. They then use transfer learning to see if the knowledge leaned by these models can be used to improve cyberbullying detection on other datasets. They find that these models coupled with transfer learning beat state of the art results for all three datasets.
I think this paper is itself a great example of how these computational models can be used in the socials sciences. This can be further incorporated in identifying fake news, deep fakes, and misinformation which can be closely linked with cyberbullying. Deep learning models of text analysis, especially those with bidirectional abilities can be extremely useful to identify contexts for nuanced information online such as news, and click bait information. I also believe this can be used to analyse some kind of algorithmic bias in online advertisements by using deep learning models.
I particularly wish to use deep learning models to analyse patterns and biases in advertisements-- both in newspapers and online. There is immense work that has been done around social media and advertisements in the space, but I do believe that newspaper advertisements can also be major avenues for bias and discrimination. While cyberbullying is one construct, I believe it can easily be extended to some of these other problems such as discrimination that we witness online.

BaotongZh commented 2 years ago

The Natural Language Decathlon: Multitask Learning as Question Answering

Link: https://arxiv.org/abs/1806.08730

1) This paper introduces the Natural Language Decathlon(decaNLP), a new benchmark for measuring the performance of NLP models across ten tasks that appear disparate until unified as question answering. It is a model that encourages a single model to simultaneously optimize for ten tasks: question answering, machine translation, document summarization, semantic parsing, sentiment analysis, natural language inference, semantic role labeling, relation extraction, goal-oriented dialogue, and pronoun resolution. The authors present a new multitask question answering network (MQAN) that jointly learns all tasks in decaNLP without any task-specific modules or parameters. Despite not having any task-specific modules, the authors trained MQAN on all decaNLP tasks jointly, and showed that anti-curriculum learning gave further improvements.

2) The decaNLP can be applied to test any NLP models we would train for text learning. Also, since the data and the data sources in social science are complex, a single system to perform different natural language tasks is crucial for transfer learning and continual learning. Therefore, decaNLP help the models related to the social science analysis perform more systematically in a single form, which means that a single model is able to deal with various task.

3) When we do our text analysis, we are not restricted ourself on a single task(i.e. sentiment analysis) but more tasks related to the text. By having a question-answering model, we could incorporate different NLP task into one model and benefits from it. In the future, we may create a model that performs well across all of decaNLP in multiple languages and build a truly multilingual model that would be able to capitalize on its understanding of many languages to perform tasks in multiple languages despite having only trained on that task in a single language. For example, we could train a single model on all the texts from the liner of a movie in different languages. Then, by using a question-answering strategy in decaNLP, we could test the accuracy of our model.

linhui1020 commented 2 years ago

Priyanga Gunarathne, Huaxia Rui, Abraham Seidmann (2021) Racial Bias in Customer Service: Evidence from Twitter. Information Systems Research 33(1):43-54. https://doi.org/10.1287/isre.2021.1058

This paper tries to investigate racial discrimination in terms of response rate difference from airline agencies on social media such as Twitter. Their research uses the deep learning model to predict the demographic information (e.g. gender and race) from their media posts. Most of the case, it is hard to get demographic information of social media users due to the privacy concern and restrictions of social media platform. The authors explicitly and fully practice a pipeline of training a deep learning model from employing AMT to do human annotations based on collected profile photos to uses word embedding model that transform social media posts into vector for model training. Human annotations is a very important and necessary step, as we just learned in class, for the improvement of prediction accuracy. The authors use CNN and Max-over-time Pooling for the architecture. They achieves 81.07% and more than 90% precision in predicting race and gender of users based on their historical social media posts.
The extension of the methodology of this paper could be used for any social science research online that requires the demographic information of users. For example, studying the information diffusion and community emergence online. Whether same gender online users would more easily to support with the opinions of each other, or racial-based linguistic analysis and etc. Beyond this, their process of training the model is not the ultimate goal of this research, instead, is a complementary but necessary step to get certain information of participants. When scholar is doing research and need specific information, more than gender and race, they could employ similar pipeline, especially for a large-scale study. In addition, how the author preprocess textual data (word embedding instead of one-hot encode) and select the activation function also give much insights in realizing the high prediction accuracy.
I am particular interested in implementing such methodology for Reddit data to identify the demographic information of users. Unlike twitter, Reddit does not include any useful personality data except users' post and comments. Can we build an unsupervised prediction model by clustering for example to predict the gender and race of users? In reddit, users may more willing to show their true opinions(e.g. their preference of politics, their attitude toward vaccination) without the pressure of being known about their personality. Human annotation may no longer required for this unsupervised model and it would be much more challenging. One possible thing might start to focus on some subreddit that the majority of participants are female or male. But this would still be a supervising mode...

Emily-fyeh commented 2 years ago

Do You Trust in Aspect-Based Sentiment Analysis? Testing and Explaining Model Behaviors The article introduces Aspect-Based Sentiment Analysis, whose aim is to classify the sentiments of a text concerning given aspects. The text being processed can be a few sentences or a full-length document, and the aspects could contain several words. The service provides an approximate explanation of its labeling, therefore, users are able to “infer the reliability of a prediction”, that is to interpret the results. The article describes the design of the package and the pipeline in detail, a review process is arranged between the prediction and post-processing step, and a method named “Professor” is designed to review the hidden states and output, in order to identify the correct/suspicious predictions. In the instances illustrated in the article and its Github page, the Aspect-Based Method is usually used to label the customer reviews. Since there could be many different reasons for reviewers to give a positive or negative comment, aspect-based sentiment analysis can focus on the given aspect to further break down the information in reviews. Apart from management study, this method can also be used in social media analysis, particularly when analyzing topics that contain many different subjects/entities. For example, social media post during elections. Since a media post, editorials, or op-eds may possibly contain positive and negative comments on different candidates or policy choices, pinning a certain candidate or policy would be extremely important to know how netizens’ opinions are presented. Actually, I am currently using this library for my tweet analysis on the Hong Kong Anti-Extradition Law Movement in 2019. I aim to track the distribution of pro/anti-governmental tweets through the process of the movement. By applying this method, I can clearly classify the entities and sentiments in tweets: The tweets having negative sentiments toward Hong Kong Police should be on the same side as having positive sentiments toward the protesters.

Yaweili19 commented 2 years ago

Deep Learning for Identification of Alcohol-Related Content on Social Media (Reddit and Twitter): Exploratory Analysis of Alcohol-Related Outcomes https://www.jmir.org/2021/9/e27314/

This article aims to identify alcohol-related content on social media (Twitter and Reddit) by thematic structures. The authors implemented a bidirectional encoder representation from transformers (BERT) neural network trained from Reddit posts from alchohol-related subreddits and from control subreddits. They then use this trained model to identify unlabelled tweets and to define alcohol-related hashtags. The authors then move on to use their model to predict alcohol-related outcomes from alcohol-related hashtaged tweets and to use geotagged tweets for spatial analysis, finding some significant results.

Obviously, this novel natural language processing approach could be easily extended to text from other sources. Its greatest contribution is that, we could apply a model trained from a labeled corpus (or ones being easy to label) into labelling documents from another corpus / source. If the two text source, or corpus, aren't much different in nature, we could expect our results to be mostly reliable, as demonstrated by the authors in their regression.

This study fine-tuned a BERT neural network as a binary classifier to predict Reddit post titles as belonging to either alcohol-related communities or a random subreddit. Next, they applied the Reddit-trained network to a smaller set of random, unlabeled Twitter posts to identify 24 hashtags that were significantly associated with alcohol content. However, the accuracy of this labelling remains rather undiscussed in that paper. The authors used a workaround of demonstrating the correlation of labelled alcohol tweets with drinking statistics. What we do with our data for model evaluation requires some creativity.

ShiyangLai commented 2 years ago

A Deep Learning Model for Detecting Mental Illness From User Content on Social Media LINK: https://www.nature.com/articles/s41598-020-68764-y

This paper deals with the problem of using observational data. The paper addresses challenges related to the problem before developing a practical method. First, the authors formalize the causal quantity of interest as the effect of a writer's intent and establish the assumptions necessary to identify this from observational data. They also propose an estimator for this setting and prove that its bias is bounded. Based on the result, they introduce TextCause— an algorithm to estimate the causal effects of linguistic properties. It leverages distant supervision to improve noisy proxies’ quality; and BERT, the pre-trained language model, to adjust for the text.
This method can be extremely helpful for researchers who are interested in estimating the causal effects of linguistic properties. For example, does writing a complaint politely lead to a faster response time? How much will a positive product review increase sales? More specifically, it provides a solution for observational research to express causal quantities in terms of linguistic properties people can observe. It also shows an example of using the method to calculate ATE of Amazon reviews on products' sales.
Personally, I am interested in understanding the intricate bias generation mechanism of online reviews, which requires a very rigorous causal inference framework. It is possible to identify biased reviews via the word embedding model (like described in this week's orient reading) or some other types of techniques. Then, I can investigate how bias within reviews can be influenced by a number of features at different levels (individual demographic features, social group features, and culture features). To do this, the method proposed in this research can be employed.

y8script commented 2 years ago

Target-Dependent Sentiment Classification With BERT

Link: https://ieeexplore.ieee.org/document/8864964

This paper introduces an extension to the classical Bidirectional Encoder Representations from Transformers (BERT) model that aims at improving the performance of target-dependent sentiment text classification. In contrast to sentence-level sentiment classification, target-dependent sentiment classification is a more precise task in terms that it examines the sentiment of a positioned target term within a sentence, assigning sentiment to possibly multiple objects in a sentence rather than determining the sentiment of the whole sentence. This allows the model to detect conflict sentiments and sentiments towards different objects in a sentence. By modifying the BERT architecture and directing the target terms (rather than a [CLS] label) to the fully connected layer, the models achieve state-of-art performance on several NLP classification tasks.
This extended model suggests the possibility for more precise sentiment extractions from the corpus. Instead of classifying the sentiment of the whole sentence, we can distinguish between multiple objects in a sentence. Even when a sentence conveys mixed feelings about many objects, the model may still capture these sentiments pretty well. For social science researchers, this means that instead of calculating the sentiment of each sentence that is relevant to a topic, we can directly analyze the sentiments assigned to each topic/term.
We can revisit some of the previous results of sentiment analysis that only classify sentences as positive, negative, or neutral. One example is to examine the 'conflict' social media posts on a topic (e.g., government reaction to COVID). We can examine the hypothesis that the conflict posts may reflect deeper thoughts on the issue compared to pure positive or negative posts. The prediction is that conflict posts have higher loads on deep or even academic aspects of the issue (extracted from LSA or similar approaches).

min-tae1 commented 2 years ago

https://doi.org/10.1016/j.chb.2018.12.029 Summary Sentiment analysis of the textual dialogue between individuals is gaining more importance as most online communications are conducted in such a form. This paper nonetheless offers a way to improve the accuracy of sentiment analysis with a deep learning approach that combines both semantic and sentiment-based representations. Employing semi-automated techniques to gather large-scale training data with diverse ways of expressing emotions to train the model led to a much-improved result compared to other methods. Application for Social Science Analysis Sentiment analysis would be well appreciated in understanding internet communities that openly display radical political motives. /qresearch/ in 8kun and other communities have become too big to ignore, as their political motives have proven to have real consequences in our society and democracy. It still is a mystery why people are participating in those movements. Sentiment analysis may offer an interpretation of the affective drive of the users. Another subject of interest would be the community’s reaction to attempts aiming to discourage those users. News outlets offer fact-checks to show claims by those communities are false. Politicians also criticize those claims, while also employing rhetorics that aim to discourage those users and bring harmony to a polarized society. So far, those attempts remain in vain, and we still have little understanding of why that is the case. Analyzing the sentiments of the users within the community may be helpful in understanding why those measures do not work out, by illuminating the immediate emotional reaction toward those attempts. Data Reddit pushshift has data on subreddits with radical political motives, such as r/theDonald. Also, I am gathering data on South Korean conspiracy communities that claimed that the election was rigged and that the Chinese Communist Party infiltrated South Korean Cyberspace to spread disinformation favorable to the CCP. Analyzing sentiments of those communities may be helpful in appreciating their emotional motives for participating in the community and how they react to measures that try to dismantle the community.

chentian418 commented 2 years ago

I still have two questions after skimming through Application of Deep Learning Approaches for Sentiment Analysis: 1) besides capturing contextual meaning of words, what are other advantages of using BERT over traditional word embedding models like Word2vec and GloVe for sentiment analysis? 2) Do you think it's necessary to implement sentiment analysis by building contexually sensitive dictionaries which captures words related to sentiments using the text representation models the book mentions?

Paper: https://www.nber.org/papers/w29344 Business News and Business Cycles

This paper proposes an approach to measuring the state of the economy via textual analysis of business news. From the full text of 800,000 Wall Street Journal articles for 1984–2017, the authors estimate a topic model that summarizes business news into interpretable topical themes and quantifies the proportion of news attention allocated to each theme over time.
This paper has significant economic meaning: it characterize the topical structure in business news. Specifically, WSJ news decomposes into easily interpretable topics with intuitive time series patterns. We could use the text of business news to summarize wide ranging facets of the state of the economy.
Based on the existing research, I want to extend the research to extract the median factor out of not only economic news underlying WSJ text, but also in broader sense of news, such as new coming from Dow Jones Newswires. In this way, we could examine the robustness of the macro signals to the business news.

Hongkai040 commented 2 years ago

https://idp.springer.com/authorize/casa?redirect_uri=https://link.springer.com/article/10.1007/s12559-021-09881-2&casa_token=iHxcjAwx2XoAAAAA:iOrfUBwxVxBwUx9rZIdpf4NsQhtHqGEInglp6nKNfuA0rq-_hL0tdbqUHwPo2jjDKfMzd8tVvSUMjGSB

This paper not only analyzed the gender bias induced by BERT in down steam tasks but also proposed solutions to reduce gender bias. For the investigating gender bias part, they trained very simple multi level perceptron-regressors to exploit BERT embeddings. They trained five regressors in total for five different downstream tasks and observed that the regressors consistently assign high sentiment intensity scores to either of the genders. Then they proposed an algorithm to find the sentiment passing direction and remove those gender-rich features from the BERT model.

The algorithm could be used to create a less biased BERT Model for many social research questions have to do with gender. It can reduce the biases introduced by the BERT model. What’s more, this algorithm could also be adapted and used to study other types of social biases like race, ethnicity, religion, etc.

I would probably use this method to explore the sentiments toward movie characters using movie comments. I can use this method and BERT model to study the difference of sentiment toward male and female characters. Since this method could help me control the bias introduced by BERT, I can better estimate biases from commenters. In my scenario, I will first use other methods to identify whether a comment is related to the male characters in the movie or female characters. So comments are divided into two categories. We can manually label some comments sentiment polarity or use the rating information to get sentiment labels. Since the training set is noisy, human checkers are needed to test the validity of the model. Then, I can use this method to create less biased model to get the sentient polarity labels for those comments then analyze the sentiment disparity with those gender-related comments.

thaophuongtran commented 2 years ago

Document Processing: Methods for Semantic Text Similarity Analysis Abdul Wahab Qurashi; Violeta Holmes; Anju P. Johnson Publisher: IEEE https://ieeexplore.ieee.org/abstract/document/9194665

Summary: The paper applied different techniques for semantic text similarity measurements in documents used for safety-critical systems, particularly in the documents on railway safety. Given the unstructured nature of the documents, there is an extensive preprocess and cleaning stage of the data. Jaccard and Cosine similarity metrics were utilized in this natural language processing techniques.

Application for Social Science Analysis: This paper has significant application in social science analysis as it introduced an automated systems for document analysis. While the traditional approach such as one-hot-vectors are not suitable for extensive dictionary of words, distributed representation or word embedding, such as word2vec and Skip-gram, applies a distributional hypothesis where similar words are likely to appear in the same places, which help evaluate text similarity between documents. The application is far beyond documents on railway safety and can be used for documents comparison within and across many other fields: economics, history, psychology, for better search engine, patent application, and authenticity check.

Data: The U.S. Patent and Trademark Office (USPTO) has one of the largest repositories of scientific and commercial information through their open data portal. During the patent search and examination process, it is critical to determine if an invention has been described before. One way to assist this objective is to measure the semantic similarity between phrases. It would interesting to apply this frame and develop for the USPTO patents data.

hsinkengling commented 2 years ago

TopicNet: Semantic Graph-Guided Topic Discovery

https://proceedings.neurips.cc/paper/2021/file/0537fb40a68c18da59a35c2bfe1ca554-Paper.pdf

The article introduces TopicNet, a variation of hierarchical topic modeling that incorporates prior knowledge about semantic hierarchy into the models. TopicNet does this by representing each topic as an embedding, and projecting the topics into a shared embedding space. Neural layers constrain topics onto predefined concepts. The authors tested TopicNet with common benchmark datasets and found that TopicNet performed near most state-of-the-art models. It is also able to capture, as intended, topic words based on predefined topic hierarchies.
I have encountered many incidences, in my own research and others, where a purely unsupervised topic modeling just wasn't able to cluster the documents according to the research question. This could help researchers create classification labels for entities within documents (eg. identifying key political actors) or force important differences on messy data (eg. different conceptions of sexuality on lyrics data).
I think what's exciting about this approach is its potential to fit complex theoretical constructs onto empirical data. For example, philosophers might define logic as having a structure of "premise", "assumption" and "conclusion". Communication scholars also like to define news as having 5W1H or a reverse pyramid structure covering a list of information in descending order of important. For each of these cases, given transcripts of structured debates or news articles, one could fit these theoretical structures onto the data using TopicNet and facilitate coding and discovery of patterns within these data.

mdvadillo commented 2 years ago

Bootstrapping for Text Learning https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.29.2001&rep=rep1&type=pdf

this paper proposes using bootstrapping to label data that will later be used in text learning models. The idea is to first initialize the labeling process by hand-picking a small subset of keywords, and applying this keywords to the unlabeled data (resulting in labels). Then the information from the labeled examples is used to retrain the model, and then use the retrained model to relabel data. The accuracy of the results approaches the accuracy obtained if a human had labeled the data.
this technique can be applied to label data that has been scraped and does not contain a clear label. It would shorten the amount of time it takes to label the whole dataset, while allowing researches to manually label only a small subset of the data.
Labeling news articles: suppose you wanted to label news articles based on their content and get a keyword describing its general topic and also a label of political leaning for the article. This could be useful if you were trying to study how biased news articles can be depending on topics, study whether they incorporate the other side's point of view, etc. We can implement the labeling technique described in the paper by first collecting the set of news articles that would form our corpus, then manually labeling a small subset of the articles and picking the keywords we find most valuable, train a labeling model on the labeled data, and then utilize the bootstrap procedure to iterate over the unlabeled articles in the corpus and label them.