bstewart / stm

An R Package for the Structural Topic Model
Other
400 stars 98 forks source link

Subsetting tweets relevant to my topics #214

Open Dom-320 opened 4 years ago

Dom-320 commented 4 years ago

I ran an STM analysis via the STM package on 56 topics from my dataset (dataset contains a number of tweets): XSTM <- stm(out$documents, out$vocab, K=56, max.em.its=75, init.type="Spectral", seed=8458159)

and plotted it. plot(XSTM, type="summary", xlim=c(0,.2))

Out of the 56 topics, there are 11 of them which are relevant to me. I want to subset all the tweets that are linked to these 11 topics from my original dataset. The only way I can think of is to manually get the most frequent words for all 11 topics: plot(XSTM, type="labels", topics=c(1,2,3,4,5,6,7,8,9,10,11)) Then manually write all these key words down and subset my original dataset so that I would get only tweets that contain at least one of them, like this for example: Dataset$Trump <- str_extract(Dataset$tweet_text, "Trump") Dataset$Hillary <- str_extract(Dataset$tweet_text, "Hillary") Dataset$president <- str_extract(Dataset$tweet_text, "president") Dataset_keywords <- Dataset %>% filter_at(vars(5,6,7), any_vars(. %in% c('Trump',"Hillary","president"))) The problem is that the example above only has three words - in reality, I have 11 topics and each has around 8 most frequent words as identified by STM, which gives me around 88 terms.

Does anyone have an easier way of identifying tweets relevant to my selected topics?

bstewart commented 4 years ago

I would recommend using the topic loadings. You can use findThoughts() to do this automatically or you can just query XSTM$theta which has the topic proportions (columns) for each document (rows)