I ran an STM analysis via the STM package on 56 topics from my dataset (dataset contains a number of tweets):
XSTM <- stm(out$documents, out$vocab, K=56, max.em.its=75, init.type="Spectral", seed=8458159)
and plotted it.
plot(XSTM, type="summary", xlim=c(0,.2))
Out of the 56 topics, there are 11 of them which are relevant to me. I want to subset all the tweets that are linked to these 11 topics from my original dataset. The only way I can think of is to manually get the most frequent words for all 11 topics:
plot(XSTM, type="labels", topics=c(1,2,3,4,5,6,7,8,9,10,11))
Then manually write all these key words down and subset my original dataset so that I would get only tweets that contain at least one of them, like this for example:
Dataset$Trump <- str_extract(Dataset$tweet_text, "Trump") Dataset$Hillary <- str_extract(Dataset$tweet_text, "Hillary") Dataset$president <- str_extract(Dataset$tweet_text, "president") Dataset_keywords <- Dataset %>% filter_at(vars(5,6,7), any_vars(. %in% c('Trump',"Hillary","president")))
The problem is that the example above only has three words - in reality, I have 11 topics and each has around 8 most frequent words as identified by STM, which gives me around 88 terms.
Does anyone have an easier way of identifying tweets relevant to my selected topics?
I would recommend using the topic loadings. You can use findThoughts() to do this automatically or you can just query XSTM$theta which has the topic proportions (columns) for each document (rows)
I ran an STM analysis via the STM package on 56 topics from my dataset (dataset contains a number of tweets):
XSTM <- stm(out$documents, out$vocab, K=56, max.em.its=75, init.type="Spectral", seed=8458159)
and plotted it.
plot(XSTM, type="summary", xlim=c(0,.2))
Out of the 56 topics, there are 11 of them which are relevant to me. I want to subset all the tweets that are linked to these 11 topics from my original dataset. The only way I can think of is to manually get the most frequent words for all 11 topics:
plot(XSTM, type="labels", topics=c(1,2,3,4,5,6,7,8,9,10,11))
Then manually write all these key words down and subset my original dataset so that I would get only tweets that contain at least one of them, like this for example:Dataset$Trump <- str_extract(Dataset$tweet_text, "Trump") Dataset$Hillary <- str_extract(Dataset$tweet_text, "Hillary") Dataset$president <- str_extract(Dataset$tweet_text, "president") Dataset_keywords <- Dataset %>% filter_at(vars(5,6,7), any_vars(. %in% c('Trump',"Hillary","president")))
The problem is that the example above only has three words - in reality, I have 11 topics and each has around 8 most frequent words as identified by STM, which gives me around 88 terms.Does anyone have an easier way of identifying tweets relevant to my selected topics?