Questions on Topic Modeling

brianp1 commented 7 years ago

Not sure if this is the correct place to post a question, but here it goes: I was reviewing the topic modeling code we went over in class as I was trying to figure out how to write the code for my final project.

In this chunk of code
```
library(topicmodels)
chapters_lda <- LDA(chapters_dtm, k = 4, control = list(seed = 1234))
chapters_lda
```
what is the control doing,exactly? The help file just says it controls the parameters.
Conceptually, my data is already in a tidy text format, is it redundant for me to run it through the cast_dtm function just to have it in the tidy format, or do I need the cast_dtm function in order to pass the document term matrix throught the lda function?
So, this chunk of code:
```
top_terms <- chapters_lda_td %>%
group_by(topic) %>%
top_n(5, beta) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms
```
It organizes the terms with the highest beta within the topic, and this is how we determine the topics with the words with the highest beta?
Writing my own code, I decided to try to format the code from tidy text to just get a sense of what I am doing and I simply got two topics of stop words even though I thought I the anti-join got rid of the stop words? In addition, I was wondering how to add stop words like Hillary since I really want to tease apart differences in policy and politics.
```
speech_td <- speech_corpus %>%
group_by(author, docnumber) %>%
count(word) %>%
select(author, word, n, docnumber) 
speech_td
```

speech_dtm <- speech_td %>% anti_join(stop_words, by = c(term = "word")) %>% cast_dtm(word, n, docid) speech_dtm

speech_lda <- LDA(speech_dtm, k = 2)

speech_lda_td <- tidy(speech_lda)

top_terms <- speech_lda_td %>% group_by(topic) %>% top_n(5, beta) %>% ungroup() %>% arrange(topic, -beta) top_terms

bensoltoff commented 7 years ago

This would specify additional controls for the LDA function. Run ?LDAcontrol-class` in the console to get a list of potential options forcontrol. Note that you shouldn't need to change any of them for your model to run. At most adjustk` to control the number of topics in the model.
LDA() won't work with a tidytext data frame. It requires a document-term matrix, so you have to convert your tidy data using cast_dtm() in order to estimate an LDA model.
This identifies the words with the strongest association with the given topic. Remember that LDA doesn't tell you what the topic actually is, it just identifies them as Topic 1, Topic 2, etc. By looking at the words most strongly associated with the topic, you can attempt to label the topic given your knowledge of the words.
What is speech_corpus? Has this already tokenized the text? If so, what is the output of this?

speech_td %>%
  anti_join(stop_words, by = c(term = "word"))

You should get a data frame with the stop words removed. You can directly look at stop_words to see what terms this includes. However by only setting 2 topics in the model, any remaining stopwords might form the dominant topic structure. You could try to avoid this by instead of using a term-frequency weighted dtm, use tf-idf to weight it. To do this, change cast_dtm to cast_dtm(word, n, docid, weighting = tm::weightTfIdf) This uses the weightTfIdf function from the tm library to adjust the weights given how frequently a term appears across all documents.

brianp1 commented 7 years ago

Thank you. With your assistance, I was able to get my first 4 topics modeled. Not shocking, the topics all resemble american, people, country, and president. So, I need to go back and modify my stop words and try to the tf-idf. A couple more questions. So, I was updating my sentiment graphs, and this results dawned on me: percent between candidates And I realize all this is demonstrating to me is that I have significantly more Trump speeches than I do ay other candidate. This will also be a problem when examining the topics of the campaign. So, I have a few options.

I can either match the number of speeches between Sanders, Clinton, and Trump by randomly selecting 17 speeches from each for this particular analysis and the topic modeling for the entire campaign
I can just accept the sheer number of speeches and possibly write it up as the amount of information that the voting public was subjected to.
I ignore these analyses and simply run topic modeling for each given candidate and then compare the candidates. I was planning on doing this anyways, but this would simply take the forefront of the topic modeling analysis
I try to control for a politician who simply has more words by something like dividing by the number of documents to get like an average sentiment?

In addition, I was wondering in what capacity I could try n-grams? It may not be incredibly beneficial for the sentiment aspect, but would it be useful for the topic modeling?

bensoltoff commented 7 years ago

Differential number of speeches per candidate isn't a bad thing. In your graphs, present the bars as a percentage of each candidate's speeches allocated to each emotion, rather than raw frequency count. This will normalize for the total number of words in each candidates' corpus and allow you to compare relative affective content between candidates.

You could use n-grams for topic modeling (not sentiment analysis, well certainly not in an easy manner that can be done before the project is due), especially if key phrases or slogans are used repeatedly (#MAGA).

brianp1 commented 7 years ago

I'll look through the n-gram literature in a minute. So, I was doing the inverse doc frequency, and I went back to try and clean it up, get rid of general names, state names, contracts, so I decided to create and add my own list of stop words:

mystopwords <- data_frame(word = c("texas", "smith", "cooper", "tianna", "barbara", "freia", "ruline", "miami", "reid", "caroline", "smith", "netanyahu", "michael", "gordon", "gordy", "sharansky", "don't", "that's", "they're", "we're", "mcdowell", "steve", "sanders", "milwaukee", "maine", "jackson", "indiana"))

But, when I go to filter the document to make sure these words are no longer there, I get this error message:

Error in eval(substitute(expr), envir, enclos) : corrupt 'grouped_df', contains 116895 rows, and 331040 rows in groups

I am not sure what is going on.

Also, I am trying to get an average of this count, essentially the number of times there is a pause for chanting or applauding divided by the number of speeches given: Here is where I am at so far, but I don't think I am on the right track seeing as I want to divided the number of pauses with the total number of speeches:

speech_corpus %>%
  group_by(author) %>%
  filter(word == "applause" | word == "cheers")
  count() %>%
  kable()

speech_corpus %>%
  group_by(author, docnumber) %>%
  filter(word == "applause" | word == "cheers") %>%
  count() %>%
  mutate(app_sum = sum(n))  %>%
  mean(n)

Sorry, if this is vague, I can try to be more specific.

brianp1 commented 7 years ago

Also, is the seed variable just a random number generator or is it something I am supposed to calculate?

bensoltoff commented 7 years ago

The seed is basically a random number generator. Set it once at the beginning of the script (set.seed(1234)) and you are done.

What exactly is the code you are using to merge mystopwords with your corpus?

For the last question, I think this code will work (it does in my head at least):

speech_corpus %>%
  group_by(author, docnumber) %>%
  filter(word == "applause" | word == "cheers") %>%
  count() %>%
  group_by(author) %>%
  mutate(n_per_speech = n / n())

brianp1 commented 7 years ago

Thank you for the assistance. So, I no longer get the error. I think the problem was a broken pipe. Referring to the binding, I realized that I had performed the anti-join right before turning the vector into the dtm format, and for the tf_idf i was pushing the vector through. However, I do have another error. So, here is the code that I am using:

speech_corpus <- bind_rows(Trump_Corpus, Clinton_Corpus, Sanders_Corpus, Repub_Corpus)
mystopwords <- data_frame(word = c("texas", "smith", "cooper", "tianna", "barbara", "freia", "ruline", "miami", "reid", "caroline", "smith", "netanyahu", "michael", "gordon", "gordy", "sharansky", "don't", "that's", "they're", "we're", "mcdowell", "steve", "sanders", "milwaukee", "maine", "jackson", "indiana", "iowa", "september", "dr", "al", "gabby", "jack", "ben", "vermont"))
mystopwords <- bind_rows(stop_words, mystopwords)

speech_td <- speech_corpus %>%
  group_by(author, docnumber) %>%
  filter(word != "applause") %>%
  count(word) %>%
  select(author, word, n, docnumber) %>%
  mutate(docid = paste0(author, docnumber)) %>%
  anti_join(mystopwords)
speech_td

inverse_doc_freq <- speech_td %>%
  bind_tf_idf(word, docid, n) %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word))))
inverse_doc_freq

ggplot(inverse_doc_freq[1:25,], aes(word, tf_idf, fill = author)) +
  geom_bar(alpha = 0.8, stat = "identity", scales = "free") +
  coord_flip()

inverse_doc_freq %>%
  group_by(author) %>%
  top_n(20)%>%
  ungroup() %>%
  ggplot(aes(word, tf_idf, fill = author)) +
  geom_bar(stat = "identity") +
  facet_wrap(~author, scales = "free") +
  coord_flip()

However, I keep getting the contractions in the Sanders result. Is it a problem with the apostrophes? I even tried adding spaces before and after the words to see if that would make a difference but it didn't.

Also, I am working on cleaning up my web scraping process and I just can't seem to get it to work, and I think at this point, I have just been starring at it too long.

Here is the code:

get_Trump_speeches <- function(x, y){
  mutate(trump_text_url = str_c("http://www.presidency.ucsb.edu/ws/index.php?pid=", x))

  df1 <- read_html(trump_text_url) %>%
    html_nodes("p") %>%
    html_text()

  df2 <- read_html(trump_text_url)%>%
    html_node(".docdate")%>%
    html_text()

  speech <- data_frame(text = df1) %>%
    mutate(author = "Trump",
           docnumber = y,
           parnumber = row_number(),
           date = df2) %>%
    separate(date, into = c("date2", "year"), sep = ",") %>%
    separate(date2, into = c("month", "day"), sep = " ")
  speech <- unnest_tokens(speech, word, text, token = "words")  
  return(speech)
}  
x = c("119182",  
  "119181",
  "119188",
  "119187",
  "119186",  
  "119185",  
  "119184",  
  "119183",
  "119174",
  "119172",  
  "119180",  
  "119173",  
  "119170",  
  "119169",  
  "119168",  
  "119167",  
  "119166",  
  "119179",  
  "119202",  
  "119201",  
  "119200",  
  "119203",  
  "119191",  
  "119189",  
  "119192",  
  "119207",  
  "119208",  
  "119209", 
  "119190",  
  "119206",  
  "119206",  
  "119193",  
  "119205",  
  "119178",  
  "119204",  
  "119194",  
  "119195",  
  "119177",  
  "119197",  
  "119199",  
  "119198",  
  "119196",  
  '119176',  
  "119175",  
  "119165",  
  "119503",  
  "117935",  
  "117791",  
  "117815",  
  "117790",  
  "117775",  
  "117813",  
  "116597")
y = c("1",
      "2",
      "3",
      "4",
      "5",
      "6",
      "7",
      "8",
      "9",
      "10",
      "11",
      "12",
      "13",
      "14",
      "15",
      "16",
      "17",
      "18",
      "19",
      "20",
      "21",
      "22",
      "23",
      "24",
      "25",
      "26",
      "27",
      "28",
      "29",
      "30",
      "31",
      "32",
      "33",
      "34",
      "35",
      "36",
      "37",
      "38",
      "39",
      "40",
      "41",
      "42",
      "43",
      "44",
      "45",
      "46",
      "47",
      "48",
      "49",
      "50",
      "51",
      "52",
      "53")
map2(x, y, get_Trump_speeches)

Also, I just wanted to say thank you sooo much for all your assistance today. You really have taught me a lot, and I am truly appreciative for the continued guidance.

bensoltoff commented 7 years ago

On the first issue, tidytext doesn't do anything with contractions. "can't" is a valid token in the eyes of tidytext. You'd have to manually remove the contraction, but that loses some important meaning. "We can do this!" is a positive affirmation. "We can't do this!" is a negative affirmation.

If you stick to topic modeling or predicting candidate based on their text, contractions are not a problem. If you want to do sentiment analysis, check out the replace_contraction function in qdap. It looks like it replaces common contractions with the full term. I.e. "can't" becomes "cannot", "don't" becomes "do not", etc. I've never used it before, but it might prove useful.

bensoltoff commented 7 years ago

mutate(trump_text_url = str_c("http://www.presidency.ucsb.edu/ws/index.php?pid=", x))

This is your problem. mutate only works on data frames, but you are just trying to create a string object. Change it to trump_text_url = str_c("http://www.presidency.ucsb.edu/ws/index.php?pid=", x) and the function should work.

Also, you're making it inefficient by creating a separate vector for y - document number. Just create it using the map_df function, like this:

library(tidyverse)
library(rvest)
library(stringr)
library(tidytext)

get_Trump_speeches <- function(x){
  trump_text_url <- str_c("http://www.presidency.ucsb.edu/ws/index.php?pid=", x)

  df1 <- read_html(trump_text_url) %>%
    html_nodes("p") %>%
    html_text()

  df2 <- read_html(trump_text_url)%>%
    html_node(".docdate")%>%
    html_text()

  speech <- data_frame(text = df1) %>%
    mutate(author = "Trump",
           parnumber = row_number(),
           date = df2) %>%
    separate(date, into = c("date2", "year"), sep = ",") %>%
    separate(date2, into = c("month", "day"), sep = " ")
  speech <- unnest_tokens(speech, word, text, token = "words")  
  return(speech)
}

# does it work for a single speech?
get_Trump_speeches(119182)

# okay let's do it for all speeches
## store ids as a numeric vector because they are numbers
x = c(119182,
  119181,
  119188,
  119187,
  119186,
  119185,
  119184,
  119183,
  119174,
  119172,
  119180,
  119173,
  119170,
  119169,
  119168,
  119167,
  119166,
  119179,
  119202,
  119201,
  119200,
  119203,
  119191,
  119189,
  119192,
  119207,
  119208,
  119209,
  119190,
  119206,
  119206,
  119193,
  119205,
  119178,
  119204,
  119194,
  119195,
  119177,
  119197,
  119199,
  119198,
  119196,
  119176,
  119175,
  119165,
  119503,
  117935,
  117791,
  117815,
  117790,
  117775,
  117813,
  116597)

# now let's use map_df to iterate over all of them and create an id variable
speeches <- map_df(x, get_Trump_speeches, .id = "docnumber")

brianp1 commented 7 years ago

Hey Dr. Soltoff.

I committed and pushed up an R markdown titled Rough Draft. It is all the code on my project up until this point. I looked through it and took notes on how to improve it and where to make things more clear. If you have a chance, I would greatly appreciate feedback on it. I am not sure if you can look at it simply in the repo or if I need to make a pull request.

Thanks Brian

brianp1 commented 7 years ago

Is there a potential way to include multiple position arguments such as jitter and dodge. Also, is there a way to have the postion dodge determined by a particular variable.

speech_corpus_affin %>%
  mutate(month = factor(month, levels = month.name))%>%
  group_by(author, docnumber, month, year) %>%
  summarize(sum(score)) %>%
  ggplot(aes(month, `sum(score)`, fill= year)) +
  geom_bar(aes(width = .25), stat = "identity", alpha = .8, position = "dodge") +
  facet_wrap(~author)

Also, I was wondering if you could help me figure out the function by which I would mutate the data in order to get percentage of sentiment at a given time. I am not sure if that makes sense. I'll try to figure out what it is I am trying to look at. Trying to figure out the sentiment according to each month from each author as it changes over time. I'll play around with this and if you have any ideas or suggestions, I'm all ears

speech_corpus_affin %>%
  mutate(month = factor(month, levels = month.name))%>%
  group_by(author, docnumber, month, year) %>%
  summarize(sum(score)) %>%
  group_by(author, month, year) %>%
  mutate(sum_sent = sum(month))

bensoltoff commented 7 years ago

I'm not sure why you would want to use dodge and jitter on the same layer. position = "dodge" is intended for bar charts. position = "jitter" is intended for scatterplots. You shouldn't need them on the same graph. Dodging the position simply splits a stacked bar chart into a dodged bar chart so each bar begins at the same origin point on the y-axis. I don't understand what you mean by having "postion dodge determined by a particular variable".

As for the second question, what is the denominator for the percentage? Right now you have summarized it by adding all the sentiment scores - the negative and positive values cancel out. In order to create a percentage, you need a numerator and a denominator. The numerator would be the aggregated sentiment score, but I don't know conceptually what the denominator should be.

brianp1 commented 7 years ago

I am having trouble with rendering the site. I thought everything was in order, and I have the markdowns, and the YAML files I took from the tutorial, yet I get this error:

Error in yaml::yaml.load(string, ...) : Parser error: while parsing a block mapping at line 1, column 1did not find expected key at line 6, column 3

I was able to knit my markdowns earlier as well, but now this error pops up as well

bensoltoff commented 7 years ago

This happens after running rmarkdown::render_site()?

brianp1 commented 7 years ago

This happened after I tried to update my tabs in the YAML file

bensoltoff commented 7 years ago

Copy and paste the YAML content you tried to add here.

brianp1 commented 7 years ago

name:"Textual Analysis of the 2016 Presidential Campaign Speeches"
output_dir: "."
navbar:
  title:"I Know Words, I Have The Best Words"
  left:
    - text: "Home"
      href: index.html
    - text: "About"
      href: about.html
    - text: "Sentiment Analysis"
      href: Sentiment.html
    - text: "Topic Modeling"
      href: Topic_Modeling.html

bensoltoff commented 7 years ago

name: "Textual Analysis of the 2016 Presidential Campaign Speeches"
output_dir: "."
navbar:
  title: "I Know Words, I Have The Best Words"
  left:
    - text: "Home"
      href: index.html
    - text: "About"
      href: about.html
    - text: "Sentiment Analysis"
      href: Sentiment.html
    - text: "Topic Modeling"
      href: Topic_Modeling.html

You need to add spaces between name: and "Textual.... Same thing for the title.

brianp1 commented 7 years ago

Alright, I updated that and I get the same error message when running render_site() as well as attempting to knit my document

bensoltoff commented 7 years ago

Make sure the YAML file is saved as _site.yml. When I used the YAML file above, your site rendered fine (well, I got a different error related to one of your plots but that is an entirely different issue).

brianp1 commented 7 years ago

Duh, I need to save it. Do you know which plot it is? That way I can take a look at it while this thing is rendering?

bensoltoff commented 7 years ago

Nope, because your chunks are so large there are multiple plots in each one. Break it into smaller chunks and then when you try to knit the document it will tell you which chunk caused the error.

brianp1 commented 7 years ago

okie doke. will do

cfss-old / fp-brianp

Questions on Topic Modeling #1