Data4Democracy / far-right-analysis

Analysis related to the behavior of extreme far right online communities
35 stars 10 forks source link

Topic modeling for Breitbart #11

Open sjacks26 opened 7 years ago

sjacks26 commented 7 years ago

We have a substantial collection of Breitbart articles (240k+ from 2012 through mid-January 2017, plus daily scrapes since January 15). The site has sections that roughly correspond to categories, but there's only a handful of them and some are bureaus (like London and Jerusalem). The Breitbart archive collection on data.world doesn't contain the full article text, so this task will use the data in the s3 collection (more info about that data here). The collection has a file for each of the sections.

It would be cool to do some unsupervised topic modeling (LDA, for example) to see what topics emerge from the data. Two logical ways this could be done. The first thing to try is topic modeling across all Breitbart articles. Depending on the results of that, the next thing to try is topic modeling across articles separated by sections.

One of the most difficult parts of unsupervised topic modeling is getting coherent topics that are meaningful. If you want some help figuring out whether your topics make sense, drop a line in the far-right slack channel.

Once we have good topics for Breitbart, we can do all sorts of interesting and useful analysis. For example, we can see whether there is variation over time in the number of articles for different topics. But first, we need topic models!

bartleyn commented 7 years ago

I think when it comes to modeling something like Breitbart, it might also be interesting to see if we can infer something about the authors as well, so I'm trying two things:

1) an Author-Topic model on the entire corpus 2) an LDA on the entire corpus

Like you said, I agree that the easiest thing is to do it across the entire corpus and then let those results inform what to do next.

I was originally going to make use of a MatLab implementation of both algorithms (from UCI here in case you're interested), but since Gensim recently released an implementation of an author-topic model, I may as well shift to a Python-based approach. Building a Python framework for running these models may make future analyses easier anyway.

musiciancodes commented 7 years ago

I'm interested in trying this and using Gensim/nltk.