Open brianp1 opened 7 years ago
LDA
function. Run ?
LDAcontrol-class` in the console to get a list of potential options for
control. Note that you shouldn't need to change any of them for your model to run. At most adjust
k` to control the number of topics in the model.LDA()
won't work with a tidytext data frame. It requires a document-term matrix, so you have to convert your tidy data using cast_dtm()
in order to estimate an LDA model.speech_corpus
? Has this already tokenized the text? If so, what is the output of this?speech_td %>%
anti_join(stop_words, by = c(term = "word"))
You should get a data frame with the stop words removed. You can directly look at stop_words
to see what terms this includes. However by only setting 2 topics in the model, any remaining stopwords might form the dominant topic structure. You could try to avoid this by instead of using a term-frequency weighted dtm, use tf-idf to weight it. To do this, change cast_dtm
to cast_dtm(word, n, docid, weighting = tm::weightTfIdf)
This uses the weightTfIdf
function from the tm
library to adjust the weights given how frequently a term appears across all documents.
Thank you. With your assistance, I was able to get my first 4 topics modeled. Not shocking, the topics all resemble american, people, country, and president. So, I need to go back and modify my stop words and try to the tf-idf. A couple more questions. So, I was updating my sentiment graphs, and this results dawned on me: And I realize all this is demonstrating to me is that I have significantly more Trump speeches than I do ay other candidate. This will also be a problem when examining the topics of the campaign. So, I have a few options.
In addition, I was wondering in what capacity I could try n-grams? It may not be incredibly beneficial for the sentiment aspect, but would it be useful for the topic modeling?
Differential number of speeches per candidate isn't a bad thing. In your graphs, present the bars as a percentage of each candidate's speeches allocated to each emotion, rather than raw frequency count. This will normalize for the total number of words in each candidates' corpus and allow you to compare relative affective content between candidates.
You could use n-grams for topic modeling (not sentiment analysis, well certainly not in an easy manner that can be done before the project is due), especially if key phrases or slogans are used repeatedly (#MAGA).
I'll look through the n-gram literature in a minute. So, I was doing the inverse doc frequency, and I went back to try and clean it up, get rid of general names, state names, contracts, so I decided to create and add my own list of stop words:
mystopwords <- data_frame(word = c("texas", "smith", "cooper", "tianna", "barbara", "freia", "ruline", "miami", "reid", "caroline", "smith", "netanyahu", "michael", "gordon", "gordy", "sharansky", "don't", "that's", "they're", "we're", "mcdowell", "steve", "sanders", "milwaukee", "maine", "jackson", "indiana"))
But, when I go to filter the document to make sure these words are no longer there, I get this error message:
Error in eval(substitute(expr), envir, enclos) : corrupt 'grouped_df', contains 116895 rows, and 331040 rows in groups
I am not sure what is going on.
Also, I am trying to get an average of this count, essentially the number of times there is a pause for chanting or applauding divided by the number of speeches given: Here is where I am at so far, but I don't think I am on the right track seeing as I want to divided the number of pauses with the total number of speeches:
speech_corpus %>%
group_by(author) %>%
filter(word == "applause" | word == "cheers")
count() %>%
kable()
speech_corpus %>%
group_by(author, docnumber) %>%
filter(word == "applause" | word == "cheers") %>%
count() %>%
mutate(app_sum = sum(n)) %>%
mean(n)
Sorry, if this is vague, I can try to be more specific.
Also, is the seed variable just a random number generator or is it something I am supposed to calculate?
The seed is basically a random number generator. Set it once at the beginning of the script (set.seed(1234)
) and you are done.
What exactly is the code you are using to merge mystopwords
with your corpus?
For the last question, I think this code will work (it does in my head at least):
speech_corpus %>%
group_by(author, docnumber) %>%
filter(word == "applause" | word == "cheers") %>%
count() %>%
group_by(author) %>%
mutate(n_per_speech = n / n())
Thank you for the assistance. So, I no longer get the error. I think the problem was a broken pipe. Referring to the binding, I realized that I had performed the anti-join right before turning the vector into the dtm format, and for the tf_idf i was pushing the vector through. However, I do have another error. So, here is the code that I am using:
speech_corpus <- bind_rows(Trump_Corpus, Clinton_Corpus, Sanders_Corpus, Repub_Corpus)
mystopwords <- data_frame(word = c("texas", "smith", "cooper", "tianna", "barbara", "freia", "ruline", "miami", "reid", "caroline", "smith", "netanyahu", "michael", "gordon", "gordy", "sharansky", "don't", "that's", "they're", "we're", "mcdowell", "steve", "sanders", "milwaukee", "maine", "jackson", "indiana", "iowa", "september", "dr", "al", "gabby", "jack", "ben", "vermont"))
mystopwords <- bind_rows(stop_words, mystopwords)
speech_td <- speech_corpus %>%
group_by(author, docnumber) %>%
filter(word != "applause") %>%
count(word) %>%
select(author, word, n, docnumber) %>%
mutate(docid = paste0(author, docnumber)) %>%
anti_join(mystopwords)
speech_td
inverse_doc_freq <- speech_td %>%
bind_tf_idf(word, docid, n) %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word))))
inverse_doc_freq
ggplot(inverse_doc_freq[1:25,], aes(word, tf_idf, fill = author)) +
geom_bar(alpha = 0.8, stat = "identity", scales = "free") +
coord_flip()
inverse_doc_freq %>%
group_by(author) %>%
top_n(20)%>%
ungroup() %>%
ggplot(aes(word, tf_idf, fill = author)) +
geom_bar(stat = "identity") +
facet_wrap(~author, scales = "free") +
coord_flip()
However, I keep getting the contractions in the Sanders result. Is it a problem with the apostrophes? I even tried adding spaces before and after the words to see if that would make a difference but it didn't.
Also, I am working on cleaning up my web scraping process and I just can't seem to get it to work, and I think at this point, I have just been starring at it too long.
Here is the code:
get_Trump_speeches <- function(x, y){
mutate(trump_text_url = str_c("http://www.presidency.ucsb.edu/ws/index.php?pid=", x))
df1 <- read_html(trump_text_url) %>%
html_nodes("p") %>%
html_text()
df2 <- read_html(trump_text_url)%>%
html_node(".docdate")%>%
html_text()
speech <- data_frame(text = df1) %>%
mutate(author = "Trump",
docnumber = y,
parnumber = row_number(),
date = df2) %>%
separate(date, into = c("date2", "year"), sep = ",") %>%
separate(date2, into = c("month", "day"), sep = " ")
speech <- unnest_tokens(speech, word, text, token = "words")
return(speech)
}
x = c("119182",
"119181",
"119188",
"119187",
"119186",
"119185",
"119184",
"119183",
"119174",
"119172",
"119180",
"119173",
"119170",
"119169",
"119168",
"119167",
"119166",
"119179",
"119202",
"119201",
"119200",
"119203",
"119191",
"119189",
"119192",
"119207",
"119208",
"119209",
"119190",
"119206",
"119206",
"119193",
"119205",
"119178",
"119204",
"119194",
"119195",
"119177",
"119197",
"119199",
"119198",
"119196",
'119176',
"119175",
"119165",
"119503",
"117935",
"117791",
"117815",
"117790",
"117775",
"117813",
"116597")
y = c("1",
"2",
"3",
"4",
"5",
"6",
"7",
"8",
"9",
"10",
"11",
"12",
"13",
"14",
"15",
"16",
"17",
"18",
"19",
"20",
"21",
"22",
"23",
"24",
"25",
"26",
"27",
"28",
"29",
"30",
"31",
"32",
"33",
"34",
"35",
"36",
"37",
"38",
"39",
"40",
"41",
"42",
"43",
"44",
"45",
"46",
"47",
"48",
"49",
"50",
"51",
"52",
"53")
map2(x, y, get_Trump_speeches)
Also, I just wanted to say thank you sooo much for all your assistance today. You really have taught me a lot, and I am truly appreciative for the continued guidance.
On the first issue, tidytext
doesn't do anything with contractions. "can't" is a valid token in the eyes of tidytext
. You'd have to manually remove the contraction, but that loses some important meaning. "We can do this!" is a positive affirmation. "We can't do this!" is a negative affirmation.
If you stick to topic modeling or predicting candidate based on their text, contractions are not a problem. If you want to do sentiment analysis, check out the replace_contraction
function in qdap
. It looks like it replaces common contractions with the full term. I.e. "can't" becomes "cannot", "don't" becomes "do not", etc. I've never used it before, but it might prove useful.
mutate(trump_text_url = str_c("http://www.presidency.ucsb.edu/ws/index.php?pid=", x))
This is your problem. mutate
only works on data frames, but you are just trying to create a string object. Change it to trump_text_url = str_c("http://www.presidency.ucsb.edu/ws/index.php?pid=", x)
and the function should work.
Also, you're making it inefficient by creating a separate vector for y
- document number. Just create it using the map_df
function, like this:
library(tidyverse)
library(rvest)
library(stringr)
library(tidytext)
get_Trump_speeches <- function(x){
trump_text_url <- str_c("http://www.presidency.ucsb.edu/ws/index.php?pid=", x)
df1 <- read_html(trump_text_url) %>%
html_nodes("p") %>%
html_text()
df2 <- read_html(trump_text_url)%>%
html_node(".docdate")%>%
html_text()
speech <- data_frame(text = df1) %>%
mutate(author = "Trump",
parnumber = row_number(),
date = df2) %>%
separate(date, into = c("date2", "year"), sep = ",") %>%
separate(date2, into = c("month", "day"), sep = " ")
speech <- unnest_tokens(speech, word, text, token = "words")
return(speech)
}
# does it work for a single speech?
get_Trump_speeches(119182)
# okay let's do it for all speeches
## store ids as a numeric vector because they are numbers
x = c(119182,
119181,
119188,
119187,
119186,
119185,
119184,
119183,
119174,
119172,
119180,
119173,
119170,
119169,
119168,
119167,
119166,
119179,
119202,
119201,
119200,
119203,
119191,
119189,
119192,
119207,
119208,
119209,
119190,
119206,
119206,
119193,
119205,
119178,
119204,
119194,
119195,
119177,
119197,
119199,
119198,
119196,
119176,
119175,
119165,
119503,
117935,
117791,
117815,
117790,
117775,
117813,
116597)
# now let's use map_df to iterate over all of them and create an id variable
speeches <- map_df(x, get_Trump_speeches, .id = "docnumber")
Hey Dr. Soltoff.
I committed and pushed up an R markdown titled Rough Draft. It is all the code on my project up until this point. I looked through it and took notes on how to improve it and where to make things more clear. If you have a chance, I would greatly appreciate feedback on it. I am not sure if you can look at it simply in the repo or if I need to make a pull request.
Thanks Brian
Is there a potential way to include multiple position arguments such as jitter and dodge. Also, is there a way to have the postion dodge determined by a particular variable.
speech_corpus_affin %>%
mutate(month = factor(month, levels = month.name))%>%
group_by(author, docnumber, month, year) %>%
summarize(sum(score)) %>%
ggplot(aes(month, `sum(score)`, fill= year)) +
geom_bar(aes(width = .25), stat = "identity", alpha = .8, position = "dodge") +
facet_wrap(~author)
Also, I was wondering if you could help me figure out the function by which I would mutate the data in order to get percentage of sentiment at a given time. I am not sure if that makes sense. I'll try to figure out what it is I am trying to look at. Trying to figure out the sentiment according to each month from each author as it changes over time. I'll play around with this and if you have any ideas or suggestions, I'm all ears
speech_corpus_affin %>%
mutate(month = factor(month, levels = month.name))%>%
group_by(author, docnumber, month, year) %>%
summarize(sum(score)) %>%
group_by(author, month, year) %>%
mutate(sum_sent = sum(month))
I'm not sure why you would want to use dodge and jitter on the same layer. position = "dodge"
is intended for bar charts. position = "jitter"
is intended for scatterplots. You shouldn't need them on the same graph. Dodging the position simply splits a stacked bar chart into a dodged bar chart so each bar begins at the same origin point on the y-axis. I don't understand what you mean by having "postion dodge determined by a particular variable".
As for the second question, what is the denominator for the percentage? Right now you have summarized it by adding all the sentiment scores - the negative and positive values cancel out. In order to create a percentage, you need a numerator and a denominator. The numerator would be the aggregated sentiment score, but I don't know conceptually what the denominator should be.
I am having trouble with rendering the site. I thought everything was in order, and I have the markdowns, and the YAML files I took from the tutorial, yet I get this error:
Error in yaml::yaml.load(string, ...) : Parser error: while parsing a block mapping at line 1, column 1did not find expected key at line 6, column 3
I was able to knit my markdowns earlier as well, but now this error pops up as well
This happens after running rmarkdown::render_site()
?
This happened after I tried to update my tabs in the YAML file
Copy and paste the YAML content you tried to add here.
name:"Textual Analysis of the 2016 Presidential Campaign Speeches"
output_dir: "."
navbar:
title:"I Know Words, I Have The Best Words"
left:
- text: "Home"
href: index.html
- text: "About"
href: about.html
- text: "Sentiment Analysis"
href: Sentiment.html
- text: "Topic Modeling"
href: Topic_Modeling.html
name: "Textual Analysis of the 2016 Presidential Campaign Speeches"
output_dir: "."
navbar:
title: "I Know Words, I Have The Best Words"
left:
- text: "Home"
href: index.html
- text: "About"
href: about.html
- text: "Sentiment Analysis"
href: Sentiment.html
- text: "Topic Modeling"
href: Topic_Modeling.html
You need to add spaces between name:
and "Textual...
. Same thing for the title.
Alright, I updated that and I get the same error message when running
render_site()
as well as attempting to knit my document
Make sure the YAML file is saved as _site.yml
. When I used the YAML file above, your site rendered fine (well, I got a different error related to one of your plots but that is an entirely different issue).
Duh, I need to save it. Do you know which plot it is? That way I can take a look at it while this thing is rendering?
Nope, because your chunks are so large there are multiple plots in each one. Break it into smaller chunks and then when you try to knit the document it will tell you which chunk caused the error.
okie doke. will do
Not sure if this is the correct place to post a question, but here it goes: I was reviewing the topic modeling code we went over in class as I was trying to figure out how to write the code for my final project.
In this chunk of code
what is the control doing,exactly? The help file just says it controls the parameters.
Conceptually, my data is already in a tidy text format, is it redundant for me to run it through the cast_dtm function just to have it in the tidy format, or do I need the cast_dtm function in order to pass the document term matrix throught the lda function?
So, this chunk of code:
It organizes the terms with the highest beta within the topic, and this is how we determine the topics with the words with the highest beta?
Writing my own code, I decided to try to format the code from tidy text to just get a sense of what I am doing and I simply got two topics of stop words even though I thought I the anti-join got rid of the stop words? In addition, I was wondering how to add stop words like Hillary since I really want to tease apart differences in policy and politics.
speech_dtm <- speech_td %>% anti_join(stop_words, by = c(term = "word")) %>% cast_dtm(word, n, docid) speech_dtm
speech_lda <- LDA(speech_dtm, k = 2)
speech_lda_td <- tidy(speech_lda)
top_terms <- speech_lda_td %>% group_by(topic) %>% top_n(5, beta) %>% ungroup() %>% arrange(topic, -beta) top_terms