Assignments for session 6

FUB-HCC / seminar_critical-social-media-analysis

Creative Commons Zero v1.0 Universal

6 stars 7 forks source link

Assignments for session 6 #23

Open simonsimson opened 3 years ago

simonsimson commented 3 years ago

1 Reading assignment

Read Paper: Baumer, Eric & Mimno, David & Guha, Shion & Quan, Emily & Gay, Geri. (2017). Comparing grounded theory and topic modeling: Extreme divergence or unlikely convergence?. Journal of the Association for Information Science and Technology. 68. 10.1002/asi.23786.
Share one personal insight in a commentary of 150 words as a reply to this issue (e.g. an aspect you found interesting, a point you disagree, a perspective you want to explore further)

2 Cluster Analysis

Create a backup of your Output of the last assignment.
Download and setup the Jupyter notebook Assignment_5 as described in our GitHub repository (https://github.com/FUB-HCC/seminar_critical-social-media-analysis)
Load your preprocessed data and embeddings from the previous assignment. Optimize the number of clusters for k-medoids by maximizing the average silhouette score while minimizing the inertia for your data.
Sample 2 clusters you deem interesting, print them and interpret them.
Answer the following questions in a summary of ~150 words:
- What is the content of the clusters? What is the quality of the clusters?
- Would you suggest a purely quantitative approach to optimizing the clustering pipeline? Why or why not?
Commit your Notebook with outputs to GitHub: create a new folder named [name]_assignment_session5 within the folder /Pipeline/Assignment_5
Share your notebook URL in your assignment submission

Submit on Github (reply to issue) until 9 Dec 12h00 (noon)

satorus commented 3 years ago

1 Reading Assignment While reading the paper I was really surprised about the similarities in the findings of grounded theory and topic modelling research. I expected to have at least some topics differ decisively between the two or just being found in one of the research methods. Instead, we got a matching topic for each one in the other method, showing that topic modeling is in fact a valid method to analyze these kinds of textual data, despite omitting any human factor of the data and not being able to rely on knowledge about the backgrounds of the topic or being able to identify nuances like irony. I agree that none of the methods can be a complete replacement for the other, and one should probably think about having topic modeling and grounded theory used in parallel to get a more complete picture. Despite this, topic modeling can be a good starting point for an initial view and categorization of the data especially for large datasets, as it takes a lot less time compared to manually doing grounded theory.

2 Cluster analysis My first cluster consists of comments mainly about Naomi Seibt as a person, telling how beautiful and smart she is and why she is a great person everyone should listen to. These comments mostly disregard the tpoc of climate change at hand and just go on about the person, highlighting the personality cult which is existent in the climate change debates and how many people do not really care about the “content”, but just worship a person and their opinions. The second cluster actually consists of comments having a real discussion about fossil fuels, why it has run its course and about alternatives. Although there are also comments defending these kinds of fuels, there are many people telling Naomi why she is wrong and arguing about new technologies which can replace fossil fuels and help dampen the effects of the very real climate change at hand. This shows that despite being very one-sided videos, there can be meaningful discussion happening in comments, with both sides being present. Seeing the results of the clustering, I do think that a purely quantitative approach could work, as the results found are very good and contain clusters (topics) i was not able to identify by hand, mainly because of the size of the data set and the sorting done by youtube.

Link to Notebook

ChristyLau commented 3 years ago

Worked with @yaozheng600

1 Reading Assignment

The process and the result of two analysis methods for analyzing textual data were studied In this article. On both aspects, it shows a surprising similarity between two methods from interpretive social science and statistical machine learning. But for sure, both methods have their own strength which cannot be replaced. I agree with the idea of using these two methods as complementary to each other.

As also mentioned in this paper, the machine learning method can deal with massive datasets, which is time-consuming for the existing researcher practices. On the contrary, there are many aspects and human’s intuition which cannot be detected by the computer itself. From my experience on a project of content classification (e.g. language of each chapter, the page number of the tables, maps and figures etc.), we used both machine learning method and manual work on improving the accuracy of content classification. We used these methods alternatively three times. The invention by researchers helps control the direction of improvement, and they deal with the new aspects found by the computer from a more professional aspect. At the same time, machine learning saves researchers much time and offered a lot of brand new ideas on this topic.

Another point discussed in this paper is that both computer and human may have ignored or mistaken some topics. This problem might be partly fixed by using two methods alternatively. But this idea still needs to be proved. In a word, none of computer and professional scientists can be replaced by another, finding a way where both can work in harmony is the solution of a better future.

2 Clustering Assignment

I chose label 5 and 6 of 10 clusters. The amount of comments is 185. The clustering result looks good. The boundaries between clusters are obvious and there are almost no outliers. In label 5, almost every comment contains the same keywords like 'airport', 'Berlin ', which fits the topic of "the airport comparison between Berlin and Beijing". Regarding label 6, the keywords are not obvious. But we can still find the attitude of the comments: all of the comments are express their anger to the video speaker.

For the second question: I would suggest a purely quantitative approach to optimizing the clustering pipeline. Especially when the data set is huge. In big data analysis, we can quickly find the laws of the data and draw conclusions, which cannot be done by traditional methods. But if the dataset is too small, I would like to analyse manually. Because every outlier will affect the accuracy of the result and make no sense.

link to our notebook

isaschm commented 3 years ago

Reading Assignment: "Neither topic modeling nor grounded theory are applied purely by route” (p.1398). Both methods need a degree of understanding of the underlying principles and techniques. Results extracted with both methods depend on the level of experience of the practitioner. That no two social scientists would extract the same topics or even the same number of topics from a data set seems quite intuitive. I am beginning to see how much the results of ML research also depend on levels of experience. Maybe one person was exposed to a new algorithm through a colleague or a conference, maybe another person has a deeper understanding of linear algebra and statistics. I liked how the authors of the paper took this understanding and built their methodology around it. Instead of making claims about abstract analytical techniques, they applied the expectation people have towards grounded theory to a statistical model also.

Clustering: Bildschirmfoto 2020-12-06 um 20 39 23

I chose clusters assigned the label ‘14’ and ’15’ when setting the number of clusters to 20. I chose them because 15 (purple in the picture) is slightly apart from the main “body” and is adjacent only to cluster 14 (yellow) in the picture. Comments in the cluster 15 are mainly concerned with ‘rationality’ as in which belief or belief system is more “rational”. Within that topic there is not one clear political view or belief system being questioned. The cluster medoid is a comment critical of global corporations whereas the rest are questioning veganism, climate change research, etc. Cluster 14, in contrast, is concerned with ‘reality’ and discussing what is ‘real’, which is where I’m assuming the adjacency is stemming from. My experiences in the second part of the assignment as well as the reading assignment would suggest that it is not optimal to only optimize quantitatively. I would optimize in a way that benefits the question I am trying to answer. I wouldn’t know what a quantitatively ‘optimal’ result would be.

Link to Notebook

SabinaP17 commented 3 years ago

1. Reading assignment

I believe that Baumer et al.’s study (2017) offers a good overview of the main characteristics of grounded theory and (statistical) topic modelling, as well as a thorough comparison between these two types of approach methods for data analysis. Before reading this study, I was not so familiar with these approaches; therefore, I found the whole content of this article very interesting and informative. What surprised me the most regarding the results of this study was especially the fact that both grounded theory and topic modelling managed to capture similar results, although they are based on different approach methods (one is qualitative, and the other is quantitative). While each method has its advantages and disadvantages, if used together in a mixed approach, they could provide a faster and a more thoroughgoing analysis, in which the researchers’ subjective qualitative approach is backed up by a second more objective data-driven approach to observe data. Therefore, I support Baumer et al.’s argument in the conclusion, according to which this study could provide an argument for the future use of mixed methods approaches when analyzing all types of data.

2. Cluster Analysis I worked together with @xixuanzh.

iraari commented 3 years ago

1. Reading assignment Grounded theory has been critiqued because some findings were said to be based more on researchers’ political/social agenda than on empirical data. I think the same applies for the quantitative methods. The results obtained by them also depend on 1) what part of the data is taken into account, 2) how it was collected and processed, 3) what theory the data should confirm or refute. Besides the fact that different methods (e.g. LDA, LSA, NMF etc.) show slightly different results, which also depend on the number of topics and other hyperparameters, as a drawback of the topic modeling I would add the lack of generally accepted methods for evaluation of the results (e.g. perplexity, topic coherence etc). As far as I know, today there is no way to estimate how accurately the topics reflect the meanings of the texts. This concern is confirmed in the end of the article, where it’s concluded that in this study topic models identify patterns which, at some level, align with those found by human researchers. But it should be asked, how well could we understand, interpret, and trust the results of the statistical model, if it were not for the parallel research conducted by humans?

2. Cluster Analysis According to the silhouette value the optimal number of clusters is 14, in which the smallest number of comments is 52, and the largest is 132. This size allowed me to quickly read and understand if the selected topics correspond to reality. “Print medoids” feature worked very well as a filter. I was able to find clusters that are not related to the topic of climate change, such as comments about the host and guest or junk comments about nothing. It would be easy to exclude them from more meaningful comments if I’d wanted to. However, when it comes to more complex topics, it is not that easy to “guess” by what criterion the comments fell into a particular cluster. For example, in the eighth cluster, the concept of the scientific community is presented as an argument in various issues: it is described according to commentators' political views; accused of bias and conspiracy; in addition, the guest is discussed and to what extend she is a "real scientist". Therefore, I think that the quantitative approach cannot be regarded as the only one needed and comments should be analyzed by humans as well (at least at the current stage of technology development).

3. Notebook URL

raupy commented 3 years ago

Reading Big data makes it very necessary to combine quantitative methods with machine learning techniques. However, we cannot rely on algorithms 100% and will probably always (or at least a long time in the future) need a certain degree of human supervision. Therefore, I agree that none of the methods can be a complete replacement for the other and that they should complement one another. It is like for any topic in the real world: every extreme might have negative consequences, so one should try to not be too stubborn and consider other ideas, opinions, etc. and then combine them alltogether to get the best result.

Analysis

interestingClusters I set the numbers of clusters to 20 and then chose to look closer to cluster 2 and 4 (2 and 1 on the map) because they look like cute litte island where you would want to go for the holidays. The island 2 I named 'CO2Discussion' as the people there are discussing about CO2 emissions and if the climate change is human made or not. The island 4 is called 'thumbsUpAndHappy' because the people there are mostly communicating via short messages with a lot of thumbs up or other emojis to express their feelings with the Video and the content of the speech. I was interested in the other clusters from the "mainland" as well because they are all very near to each other and I was wondering what was discussed there. I chose to look at cluster 17 and named it 'factsOrNot?' because some people are saying that the human made climate change is a fact, that Espendiller is lying etc. and others say it is a fact that it is not human made and just hysteria. One comment I found particularly interesting and very representative for this cluster. The commentator is very pleased with the speech and the AfD politician: Fakten gegen Ideologie! Endlich kehrt die Vernunft zurück in den Bundestag,\nGOTT SEI DANK!!!! What are facts and who is following an ideology here? ;)

All in all I think the quality of the clusters is pretty good. At least for most of the comments I can understand why there are similar and could belong to the same cluster. The question 'Would you suggest a purely quantitative approach to optimizing the clustering pipeline? Why or why not?' I is hard to answer for me. I don't know. I think a pure quantitative approach is faster because it is fully automatable but the results are maybe not accurate. So I think it depends probably on your objective.

Notebook

mrtobie commented 3 years ago

1. Reading

The Article shows a brief overview of the resulting opportunities when machine learning and grounded theory methods are used together. I liked the idea of a hybrid solution in which it is possible to combine the strengths of both methods. On the one hand it is possible to use machine learning to handle huge sets of data and on the other a grounded theory method can make the analysis (or the results which are to be analyzed) easier to understand. While I do belive that the authors have a point in looking for similarities (which was for the most part of the paper the main topic) it is quite more interesting, to look deeper in where these methods differ from each other. Clearly both methods have proven their strengths and weaknesses in reasearch. I would find it very intersting to see where one of the methods can „help“ with the weaknesses of the other. Therefore it would be good to have a deeper analysis of the differences between those two.

2. Cluster Analysis

When optimizing the number of clusters for k-medoids, the numbers showed that a value of 2 would be ideal. Since it does not make the handeling of the data easier, this peak is discarded. Nevertheless it could be a indication, that most comments could be assigned to one topic (therefore cluster). The next peak could be seen at a number of 63 clusters. This will be chosen for forther analysis.

The two clusters i took a deeper look in were 22 and 35. I chose cluster 22 because the points on the plot were very close together. Therefore I thought it might be a good sample to look at. But when looking in the data, I saw, that under the label „Same“ there were many comments that just had links in it or other one-word-comments. There was absolutely no general topic to see.

The second cluster, number 35, was chosen because of it‘s label, that looked quite promising („Wait what happens in 2031?“). I thought I might get a somewhat more content-related cluster since the label has a clear topic. This time I was dissapointed with the result. Altough there were several comments, that clearly belong together („[…] I don‘t even know what NRPing is“ and „[…] See what I mean about NRPing. You are always NRPing away aren‘t you.“), most of the comments were way off („Well I like bacon, checkmate vegan“) or just other questions („[..] how are you Denying evidence?“).

Based on these experiences I can‘t tell whether a purely quantitative pipeline would be a good choice. For my data it clearly failed. But when it works, it can help to identify interesting topics. Therefore it would be good to start with that sort of pipeline to see if it works for the specific data. If it does work, why not use it? If not, it should be easy to tell and the pipeline can be modified.

Notebook

budmil commented 3 years ago

While reading the text at some point I got an idea of what would be the conclusion, which turned out to be exactly it: that both methods should be combined. We cannot rely on algorithms 100%, even if they get massively improved in recent future (for example recognize satire, be aware of broader context, etc). Also, even though the computational analysis does help a lot while researching, the human supervision of such process is absolutely necessary. Having a researcher dive deeper into the data is meaningful for him/her to better understand the challenge being faced. What I would also include in the whole research practice, other than checking the top answers from a certain topic, is ‘having a look’ at random sample data as well. In my opinion, being more familiar with the research insights can only be more useful.
I chose clusters 79 and 7. I picked those two clusters because they are in a way standing out (see images below). Cluster 79 is apparently full of comments mocking one of the comments that says they are spraying viagra via chemtrails. Cluster 7 is some weblinks and some one-word comments. I found it interesting how they looked isolated and wanted to see what it's about. Generally, in my opinion, quantitive analysis has be accompanied by also qualitive analysis. As we have seen in the article from 1) it should be combined. Computational calculation does help a lot and manages jobs which would not be able to be done by people, while human approach does give a final touch to a research project.

And here is the NOTEBOOK.

adrigru commented 3 years ago

1 Reading assignment In times of big data, it is necessary to incorporate quantitative methods with the assistance of machine learning techniques. However, as good as the statistical approaches are, I think there is still a need for a human to control the process. Therefore, I'd believe a mix of both methods would bring the best results. Only the combination of human, abstract thinking and high-level understanding with the high scalability of machine learning techniques can offer the best of both worlds. Therefore, I'd be interested in methods based on such a mixed approach. In practice, especially on social media platforms, the algorithms process lots of data making sure the content is not violating the guidelines. Nevertheless, there are still many humans involved in judging complex cases where it's not clear if the content is allowed or not.

2 Cluster analysis In my first cluster, it is difficult to identify the main topic of the discussion. The tone of the commentaries varies between blaming the Chinese or Indian economies and denying human contribution to the rapid growth of greenhouse gases. I couldn't say there is much similarity between the individual comments besides poor English grammar skills. The second cluster, on the other hand, contains commentaries related to providing evidence. Within this cluster, the users demand proving loose claims of, presumably, global warming sceptics. Many point out that the arguments used in the discussion are not valid or proven. Others post sources to support their reasoning. However, these sources are mostly not scientific.

In general, the quantitive approach assists finding patterns within the dataset. In our case, it helps to see what are the main topics within the discussion and how popular they are. On the other hand, a purely quantitative approach is not the ideal solution, since the actual semantics of comments are difficult to comprehend by the machine.

Notebook

Moritzw commented 3 years ago

Share one personal insight in a commentary of 150 words as a reply to this issue (e.g. an aspect you found interesting, a point you disagree, a perspective you want to explore further) The article briefly described the grounded theory and statistical topic modeling and used both methods on the same data-set. Afterwards the results were compared. Surprisingly both methods managed to capture similar topics as results which only differ slightly. Both methods have different strength, while the grounded theory is more researcher dependent and theirfore more time intensive, it includes better analysis of data containing sarcasm or irony. On the other hand statistical topic modeling can work with vastly greater data sets in a smaller amount of time since a lot of the work is done by computer algorithms. In the end the researchers propose a mixed approach to analyze data. I agree with this conclusion, thought it does not solve the problem that using either method some topics can be ignored or mistaken and that different teams of researchers using the same data and the same methods can reach different conclusions in their analysis.

What is the content of the clusters? What is the quality of the clusters? I choose to look into clusters 7 and 0, since booth were set apart from the main body and additionally cluster 0 was tightly grouped. Cluster 7 mainly discussed the antifa and if they started the fires. A lot of discussion happens, where some claim to have evidence in the form of videos or pictures and claim the opposite. The main topic of the video, that the fires may be the result of climate change is in these comments not present. Cluster 0 contains comments regrading Joe Biden. Most comments don't just express an opinion on Biden, but also comment on the American Election and if one should vote for Biden or not. Only a few comment on Bidens appearance and commentary in the video

Would you suggest a purely quantitative approach to optimizing the clustering pipeline? Why or why not?

I would suggest a mainly quantitative approach when analyzing large data-sets. The algorithm seems to work fine and in large data-sets topics are easier to identify if pre-sorted with such a quantitative method. But the algorithm should be at least spot checked, especially by cluster which are intermixed to provide better results.

Notepad File

Alioio commented 3 years ago

1 Reading assignment

The paper gives and introduction to the research method grounded theory and statistical topic modelling. It compares this two approaches by applying both on the same data-set. The paper shown that applying both approaches ends up with the similar topics with slight differences. It shows the strengths and weaknesses of both approaches. While the grounded theory is more time-consuming as it requires much more qualitative research it shows advantages where human intuition is required (e.g. detecting irony or sarcasm).

Quantitative data analysis with machine learning methods on the other hand is time-saving when a huge amount of data needs to be analyzed. Making used of computational calculation a lot of time can be saved and human Intuition can be brought to the research with qualitative methods.

2. Cluster Analysis

I decided to take a look at cluster 1 and cluster 4. Cluster 4 I found interesting to take a look at because it was in the visualization very isolated from the other clusters. By taking a look in its content I saw that this cluster mainly has the comments pointing to the link in which visualization the politician is talking about is explained. This surprised me because in my very first analysis of the comments I sorted out this comments to be one sort of comments of all.

Cluster 4 contained comment which were the users were talking about how stupid the audience (other parties) are and that they are not capable to understand the visualization. About this cluster I was very surprized that the model algorithm seems to be capable to cluster comments with similar context together. But by going trough the comments in other clusters I would not say that the topics clusters are so clearly sorted to topics. Nevertheless I would consider the output of this model as a basis for further manual analysis.

NOTEBOOK.

adrianapintod commented 3 years ago

1. Reading assignment

The article describes two different ways to analyze textual data. One of them is the "grounded theory", which is a qualitative method, and the second one is "topic modeling", a quantitative approach, both applied to the same dataset to compare them and see how much they differ or converge from each other. It is interesting to see how these two approaches gave similar results and often they were in some way highly aligned with each other, despite involving different iterative processes to carry the analysis out. While one depends fully on the human supervision, the other is an unsupervised process executed by a machine. Both methods have their advantages and drawbacks, in the particular case of the "grounded theory" it is for me still concerning the innevitable bias of the researchers when doing the analysis and regarding "topic modeling" it lacks the ability to interpret sentiments and/or intentions, however this issue might be mitigated or at least improved by using a sentiment analysis system. I find "topic modeling" a very useful method, particularly because of the little amount of time (compare to the grounded theory method) that requires to read and process texts and data. At this time both approaches seems to work complementary to each other, however it seems to be quite promising that other text computational techniques might lead to surprising results.

2. Cluster analysis

I choose a cluster with an average number of members, so it will facilitate the process of identifying the main topic. For the first cluster, the classification of the comments was according to similar characteristics in general, support to Naomi, and rejection to Greta. In the comments belonging to this cluster, people mentioned the idea about a TV debate between Naomi and Greta and which of them would win, as well as the idea that Greta would not have the capacity to answer and win the mentioned debate without having a script to follow. The included comments contain a few of the same words such as: "debate", "script", and "not anti-Greta". For the second cluster, interestingly, none of the comments were debate related (according to translation in google). However, the feature in common was the written language (different to English): four comments in German and one in Dutch.

This quantitative method would be useful to find and classify the most representative topics debated around a set of texts such as video comments on youtube or tweets. Finding such clusters is a challenging task that would take a considerable amount of time when doing it by manually, especially when the dataset is large.

Link to Notebook

alexkhrustalev commented 3 years ago

Reading assignment

The article “Comparing Grounded Theory and Topic Modeling: Extreme Divergence or Unlikely Convergence?” by Baumer, Mimno, Guha, Quan, and Gay represents the clear differences between the Social Science and the Natural Language Processing approaches, applied to the survey answers in the text format. The results of the analysis are the groups, which were formed based on the survey answers. In my opinion, both the Grounded Theory method and Statistical Topic Modelling reached very good results as the groups are not overlapping each other inside each research. However, for me as a big fan of computer methods and automation as well as time-saving, the Natural Language Processing approach is more attractive, since the time that was spent on the Social Science Research is up to 40 times more than the time spent on the Statistical Topic Modelling research. Thus, since both methods handled their tasks equally well, the computational method is more preferable for me.

Cluster Analysis

The first cluster has only two comments. These comments are about Trump and Biden, and they contain the opinion that neither Trump nor Biden are good leaders. The second cluster has 21 comments. It can be seen that this cluster has only the comments that support Biden, and defend his opinion and answers. While the clusters are quite good overall, there are some comments that do not seem to have any relation to the main topic. In my opinion, it would be possible to use a purely quantitative approach, as despite the results are not ideal, they are still very good and allow seeing more topics than one could find manually. Probably, some NLP techniques could be possible to use to validate the results and delete the wrongly assigned clusters. Overall, such an analysis gives a lot of information and creates a good understanding of the topics' division.

Link to the Notebook

anastasiia-todoshchuk commented 3 years ago

1. Reading assignment

During the reading of the paper and before I’ve reached the results section, I was really wondering about whether the results of the Grounded Theory Research and the results of the application of the Statistical Topic Modelling would be the same or not. I assumed that the result of the Social Science approach would be the topics, which are more elegant and diffuse whereas the Machine Learning approach would give more straightforward and specific topics. As it turned out, my assumption was both right and wrong. First of all, half of the topics made by Statistical Topic Modelling were classified by the general mood of the feedback: Positive, Negative, No reaction. On the other hand, the other topics by the Machine Learning method represent the specific type of feeling or actions. The topics generated vie Grounded Theory are divided into unrelated groups: triggers, morality, friend’s reactions, and so on. Thus, the Statistical Topic Modelling made more specific clusters (at least half of them are specific), while the Social Science approach shows the more smart groups.

2. Cluster Analysis

I worked together with @Einnmann.

Cloudz333 commented 3 years ago

Reading assignment:

I found that the key concepts of the study were very well expressed and structured, so that I could understand the methods used even if I was not familiar with them. However, I was able to relate the research question to a more general and discussed topic such as "Can AI replace human intelligence?", and since I am quite aware of this topic, I expected exactly these results. As in many other fields, one does not exclude the other but they are rather complementary.

What I found interesting is how the computational approach can be combined with the interpretative approach in this specific use case. Grounded Theory requires complex contextual knowledge and a relatively high amount of time to identify patterns in the data, while with topic modeling it is possible to obtain immediate results aligned with those found by human researchers. That’s why I believe that topic modeling can be a really powerful tool for exploratory analysis.

Cluster Analysis:

I decided to keep the dataset results of the previous assignment, but since my dataset is very small (total comments are 72), the overall quality of the clusters is pretty low. Anyway, I could find two interesting clusters, even if they contain a limited number of comments, respectively 5 and 10.

In the first cluster, the focus of the discussion revolves around possible hidden interests that could move green industries or oil industries. The cluster seems to have a good quality, but since the discussion is about business and money, also another comment that does have nothing to do with the debate was mistakenly included in the cluster: “ I've seen happy and even giddy people talking about trading in their two year old luxury Land Rover for a $120,000 electric luxury car for a three mile commute”. The second discussion is whether global warming can be beneficial for us, since humans can live better in warm weather. This cluster was detected very well and all comments are related to the same topic.

After this preliminary test of the pipeline and after reading the paper, I was able to better understand how the algorithm works in this context, so that in the project I would rather use a mixed approach: quantitative approach for exploratory analysis, and then proceed with a qualitative analysis.

LINK TO THE CODE

yuxin16 commented 3 years ago

1. Reading Assignment

Paper: Baumer, Eric & Mimno, David & Guha, Shion & Quan, Emily & Gay, Geri. (2017). Comparing grounded theory and topic >modeling: Extreme divergence or unlikely convergence?. Journal of the Association for Information Science and Technology. >68. 10.1002/asi.23786.

The paper compares two methods on the same data for topic generation. One is the grounded theory as interpretive approach from a social science perspective, and the other one is the LDA topic modelling form a computer science perspective. They found as well as convergence and divergence in topic grouping. They summarized the themes (from grounded theory) and topics (by LDA) in table 1 and found normally a "two-to-two" mapping between both methods.

I am not surprised by the convergence of the comparison between the two methods, since it is quite understandable and imaginable that the statistical topic modelling can somehow analyze the words based on their syntaxes. However, computational model still lacks the ability of semantic analysis and therefore, computational methods based on specific algorithms cannot conclude the words into high-level themes based on their semantics/meanings.

LDA is a "bag-of-words model", which means the method only checks the occurrence frequency of words without considering its sequences and relationships with other words. For examples, "A likes B" is regarded as the same as "B likes A" based on a LDA model. This characteristics can result in a misunderstanding of documents and therefore lead to unwilling results of topic modelling.

I still have some confusions about this paper, and I somehow think the paper doesn't have sufficient scientific rigor. Since the authors set the topic numbers to 10 based on the overall size of the corpus as well as on testing different numbers of topics. They didn't describe what they did for testing and based on what criteria they chose 10 as the topic number. As far as I know, there is a measurement which is called "perplexity" for setting a pursuable topic number for LDA, the author didn't use it and I don't know why they prefer 10 rather than other numbers.

Furthermore, the researchers separate each survey response (with two questions) into two separately answered questions and treat each answer as a single document. Thus, the total number of documentation is doubled than that used in grounded theory. So the dependency of the second question on the first one cannot be revealed in data. This may influence the topic modelling outcome which is not discussed in the paper. Additionally, the researchers moved out non-English responses in data preprocessing, but they only ignored the non-English topic (topic 6) after the topic modelling is done by LDA. The comparison results maybe different if the data input are identical for both analysis methods.

After all, the comparison and discussion for me is not insightful enough. For example, the researchers stated that the both method have iterative processes and they require different amounts of time, which we as readers know from methods themselves without analyzing any data. The discussion part is too general, it seems to me more like a general descriptive summary for both methods without digging into the interpretation of data.

2. Clustering Assignment

Based on the best silhouette score the optimal clusters for my data seem to be 100. Since most of the clusters only contain less than 5 comments from which it is hard to generalize common points, I chose the two clusters (cluster 11 and cluster 69) with the most comments (16). The cluster 11 is kind of a "BBC" group since all the comments mentioned BBC but their contents vary from complaints against BBC to "climate gate" reported by BBC. The cluster 69 can be considered as the "real talk around climate change", it contains key words such as "sea level","CO2", "human activities", "climate change", "fossil fuel", "artic ice", "ice melting" etc. It seems interesting that if someone's comment doesn't include the keywords, but it is a reply to comment which clustered to this cluster, the comment without keywords will also clustered into the same cluster.

It clearly shows that it did somehow a nice work on word recognizing based on syntax, however the digging of semantics still cannot be well undertaken by pure algorithms. The two gathered clustering are quite ok, and some small clusters with only 1 or 2 comments can be grouped together to a "non sense group".

The choice of analysis approaches is really based on your data and your analysis purpose. I think a pure quantitative approach will work well on numerical input or tabular data with words as categorical variables which can be easily processed by one hot encoding or multi code encoding. For text mining, especially semantics mining, based on the interpretability, transparency and trustworthiness of algorithms today, I won't suggest a purely quantitative approach to optimize the clustering pipeline. Currently, the algorithms are still not able to analyze the data integrated in a contextual background, which means, the algorithms analyze the data without considering the social-technological perspectives and cannot dig out the meanings and emotions hidden behind the words.

Link to Assignment5

Rahaf66 commented 3 years ago

1. Reading assignment The paper discusses a very important and interesting point especially with textual data where it is not always possible to evaluate the accuracy and the performance of the computational approaches.

As a further exploration, I am very interested in the question of whether computational approaches to text, such as sentiment analysis especially sentiment analysis of literary texts always require human interpretation, and strongly agree that for sentiment analysis also for topic modeling tasks, neither grounded theory nor computational approaches could be applied separately aiming at the best performance.

The most problematic issue in sentiment analysis is that there are no "best results" even if the researchers compare their results because the validation of the models depends on the human rating data which is subjective and sensitive to many factors (such as the situation and the mode of the reader). In other words, I think, that the grounded theory itself needs to be analyzed. From this perspective, I admire the suggestions in this paper regarding the complementary role of each approach.

2. Cluster Analysis The two clusters I have chosen are 60 with 11 samples and 64 with 6 samples. All the comments in the first cluster discuss climate change prevention and its relation to socialism regarding the role and the control of the government. The comments in the second cluster critique the skeptics and consider that old people deny the fact of climate change.

The quality of the clusters is good, the comments with the same topic were clustered together, for some clusters I noticed that the clustering depends on the frequency of some words such as the first cluster, whereas other comments were clustered based on the whole meaning as the second one.

A purely quantitative approach to optimizing the clustering pipeline would not be enough, as well a purely qualitative approach is not applicable especially for large datasets, the paper in the reading assignment presents a comprehensive explanation for that.
On the other hand, in this course, we have analyzed the same data at different levels. There are differences in the results and in my understanding of the data based on the used tool but I think that the reading of the comments (kind of grounded theory approach) as the first step was very helpful and important.

Notebook

Aylin-00 commented 3 years ago

1. Reading Assignment The paper compared two methods of analyizing and clustering documents. One called topic modelling can be automated and uses only statistical methods while the second, grounded theory, needs human supervision and language understanding.

In my eyes the grounded theory model can be seen as an optimum for the topic modelling, because social scientist are able to detect similiarties between two documents on a meta-level. Therfor I would assume that a topic model has good results, if it is close to the grounded theory model. Nevertheless the researchers mention a combined approach as an optimum, so that qualitative analysis and quantitatve analysis complement each other.

I was surprised that both methods had very similar results on this data. I would have expected that pure statistical analysis without paying attention to the context of words, the meta-level of human language etc. would generate less accurate clusters. I would like to know if that similarty is robust by testing it on a variety of data. Maybe it is possible to find some criteria to decide when the statistical analysis is suited for good clustering.

2. Cluster Analysis There are no obvious clusters in the data. Maybe the amount of samples is too small, maybe the problem is the overall similarity of the expressed opinions. Most people have a very negative view about the corona handling and use similiar arguments like the lockdowns, vaccines, globalists etc. I got the best result with 3 clusters. But I chose that number eyeballing the 2D-data, not because of the elbow method. In my graph there was no curve at all. The similiarity of the clusters was not obvious to me. Some of the few science-based comments were clustered with comments about chemtrails and the new world order.

Notebook

milanbargiel commented 3 years ago

1. Reading Assignment I liked the way the authors concluded, that machine learning methods and topic modelling offer new perspectives to a data set but that still the human perspective is needed, to contextualize the information. Their results and especially the divergent word clouds made that clear to me. Actually, it was quite difficult for me to come to the same conclusion as the researchers when reading the top 25 word lists, so I guess a lot of interpretation comes into play. Furthermore, I found potential use cases for topic models interesting. It would be very interesting to analyze personal journals or dream books with machine learning and find reoccurring themes of mental models. The ability of machine learning to classify huge (unreadable) amounts of text is very inspiring.

2. Cluster Analysis I chose clusters 14 & 4 due to their respective isolation from the main corpus.

cluster 14: This cluster is clearly separated from the other embeddings. It is a discussion from two users about the polls from the 2016 US presidential elections and their correctness. The users insult each other quite heavily and use capital letters to make a point. The proportion of replies is 100 %.

cluster 4: This cluster comprises very short, incomplete sentences, often single words like „Why?”, „Laughable”, „Ireland”. I could not identify a thematic relation between these embeddings. Interestingly, also single links and anchor tags are found within this cluster. They are cleaned by the algorithm and therefore might match the group of short, incomplete sentences.

The results were kind of disappointing to me, topics could not easily be deduced from the data set. I am not sure whether a purely quantitative approach of optimizing the clustering pipeline would help. Embeddings are very close to each other and the clustering algorithm has severe problems of identifying groups. Further human contextualization is needed to really make sense of the data and a clear research question would be helpful.

Notebook

Francosinus commented 3 years ago

1. Reading

As the name of the paper implies, the researchers compare grounded theory with topic modeling. This was really interesting to read! Since the dataset wasn't that big it was possible to apply both approaches and compare them. The grounded theory approach is a human made "topic modeling". Based on data the researchers categorize it into several topics and interpret them. The topic modeling approach does basically the same, but uses the frequency of certain words in the documents to categorize the data. Both approaches gave similar results, but the grounded theory approach is a bit more precise. Looking at the topic containing longer responses from the participants it can be seen that the topic modeling approach was not certain about the topic. Overall topic modeling still needs human interaction in terms of interpretation of the results. But if the dataset is large it makes sense to use a mathematical approach rather than a grounded theory approach, since it could be to time consuming.

2.Clustering

For the clustering part I chose two clusters with a rather high density. I assigned the topic "doom" to the first cluster, since its comments are all about people saying that we are doomed. The other cluster commonly contains the word "nuclear" in the comments, this was a main topic in the video as well. Looking at the high density clusters the results are easily interpretable. But I also looked into other clusters which points are more scattered and I determined that it was sometimes not so easy to interpret the comments or assign a topic. This was most often the case when the comments were really long. Short comments are more likely clustered together (or have more often a word in common). So maybe it would also make sense to categorize the comments into topics using topic modelling (especially for the longer ones).

Notebook

JuliusBro commented 3 years ago

Reading Assignment

Personally I would have liked some further discussion on the topic of implicit or even explicit bias in Grounded Theory. It is only mentioned in passing in the Related Work as one possible criticism and while other criticisms are refuted, this one is not further explored. I find this especially important, as a bias or preconceived notion greatly affects the hypothesis postulated in the first steps of the approach, which also shapes how the data is interpreted. It seems to me that the only safeguard against this is the rest of the scientific community who may call out these biases, which also has problems, such as the fact that the bias might not be as recognizable to others after the selection of the hypothesis or that the researchers conducting a review might also suffer from similar biases. Some more insight on how to best work around this problem of Grounded Theory or even a warning would have been helpful.

Cluster Analysis

The clusters I selected are not really representative of the dataset as a whole, as the number of single point clusters was the majority. However, the clusters that did appear do have some obvious and helpful commonalities. Both clusters have longer comments than is usual for this dataset (or maybe even Youtube comments in general). The first cluster focuses on global/US policy, especially in regard to China and actually contains reasonable discussion. The second cluster contains even longer comments discussing the reality of man-made climate change, using semi-scientific arguments to argue both for and against humanity's influence.

A purely quantitative approach might not be able to find all connections or interpretations of the data, however it helps immensely with identifying relevant clusters from which human researchers can draw the relevant conclusions.

Notebook

travela commented 3 years ago

1. Reading

The paper leaves me hopeful that no employees at the giant tech companies are actually reading my data. They too must be leveraging the countless hours that can be saved by using a computational method over a qualitative, interpretive method for usecases where the results of the methods align. It was surprising to me that the authors spent 2.5 months on the traditional method when the computational method yielded similar findings in 2 days. However there is still a chance that my data will be read: One insight was that the Topic Modeling approach also requires humans to read the data to create topic labels, albeit only on a sample basis. I felt like I would not have been able to pinpoint a topic just from looking at the word lists generated by the algorithm. One still needs to skim through the data, but a lot less extensively.

2. Cluster Analysis

There was only one cluster that stood out not just by the colour palette, but also geometrically. It's the one shown in the image below. When investigating this one it turned out to be the cluster of comments with high word counts. So then I decided to investigate the cluster that lies at the opposite end of the point cloud and it turned out those are exactly the single word comments. At this point I was a bit disillusioned as it seemed like the clustering would just group comments of the same length, which is more of a meta property and not related to the content. However, I then picked another cluster, number 21, which appeared to be of significant size. All the comments had a "male" connotation, essentially referring to Donald Trump who speaks in the video about climate change. The comments all follow the form "he does/believes/makes" or "this/the dude [verb] ...". Sometimes there is only a "him" in the comment, but the pipeline still managed to cluster those together. While this makes me more optimistic, this is still on a very shallow level when it comes to understanding the meaning of the comments and the analysis definitely still requires a significant amount of human judgement.

Notebook

DanielKirchner commented 3 years ago

1 Reading assignment

This paper shows, where qualitative and quantitative research approaches in the field of textual data can produce different (or common) insights and show advantages and disadvantages of either. I really like the takeaway here, which basically is, that neither field can give more "valid" answers, but can provide tools with different up-/downsides. The most outstanding one is of course the time/quality trade off that comes from the low throughput human workers have compared to computers. A manual review of data (see p. 1401) takes a long amount of time and can often be frustrating/non-rewarding for the researchers. Hence a mixed method of both approaches should be an aim for textual analysis. In this way an optimal trade off for time/human work/quality can be achieved.

2 Cluster Analysis

The default number of clusters (100) in the notebook did not work well for me, as there were too few members in each cluster. A cluster group of 30 was more appropriate for my comment section and yielded distinct clusters, which i concluded after manually going through some of the groups for different cluster sizes. The cluster that stood out most to be was cluster 18 with the 30 cluster configuration: output It contains comments related to politicians (specifically trump) and climate change:

Trump my mentor ☝️

To right well said. they like to criticise trump but he got that right, wise man not a sheep politician.

... WE NEED A LEADER WHO HAS GUT'S .. AN AUSSIE TRUMP TO SAY ...

Since the quantitative approach (relying only on silhouette_avg minimization) did not get me the best results, I would conclude, that a manual review by a human being has at least some value. In general the approach to optimization should always depend on the question the researcher is trying to answer with the given pipeline. Since we are not dealing with a "pure" math problem here, but with human written text, a humans opinion on the output on every algorithm is very valuable.

Notebook Link

Kosmopy commented 3 years ago

Reading Assignment The authors mention the different traditions behind the two methods. They argue that NLP draws on a positivist tradition while Grounded Theory comes from an interpretivist tradition. I think we should not underestimate the extent to which machine learning results can be influenced and, therefore, differently interpreted by users: Ranging from the choice of the sample, the cleaning process to setting hyper-parameters. Thus, it would be another very interesting research question to challenge the idea of the objectivity of the NLP approach by comparing the results of the same approach conducted by different research teams.

Cluster Analysis Viewing the clusters, my first impression is that the clusters were created without knowing “the research question”. Comments of the cluster displayed first seem to have a reference to language or, more precisely, to German in common. This means that comments either mention language/German or the comments are just written in German. In the latter, the content of the comment is widely neglected. In the second cluster, words like misinformation and propaganda are predominant. The comments seem to share a topic, however, this does not mean that they share an argument. How does the “optimized” pipeline look like? Which purpose is it supposed to fulfil? I guess these are questions which need to be answered before thinking of the right approach.

Notebook

xixuanzh commented 3 years ago

Reading will do it later

Cluster Analysis with Sabina P We selected the largest cluster at first and later randomly picked one other cluster to test the quality of the cluster. Cluster 4 (the largest cluster) that contains links, short phrases, and emojis provides a clustering result with medium quality, while Cluster 20 concentrating on technological features and platform policy on Youtube has a high quality of clustering. The quantitative metric such as silhouette_avg do help us to make decision. Yet, we should only use it as a reference. Choosing 2 als number of clustering will lead to undergeneralizing, though it has the highest silhouette_avg value. Like the way to choose optimal solution when applying k-mean, we should select the number of cluster based on our experience and observation of the dataset. https://github.com/FUB-HCC/seminar_critical-social-media-analysis/blob/master/Pipeline/Assignment_5/sabina_p_and_xixuan_zhang_assignment_5/Assignment-5_Clustering.ipynb