Annie's Chpt 17 Notes - Githubissues

Line 65: What do you mean by simulate when it comes to text data? This chapter does not appear to follow the same workflow as previous chapters
Line 67: Chapter does not include regression or word embeddings. Are these to come?
Codeblock @ line 157: When removing stop words, are adjacent stop words meant to remain in the string? Originally the code here was not removing these due to spacing issues (see output in book) so I added in a few extra cleaning steps assuming they are meant to be removed, however if this is not the case then it should be explained why some stop words are allowed to remain in the data.
Codeblock @ line 282: This was not a very informative example for ngrams, I changed it to some text from Don Quixote
Line 296: The explanation of "Canadians", "Canadian", and "Canada” here does not match the output of char_wordstem(c("Canadians", "Canadian", "Canada"))
Line 743: I don’t think there is a way to explain the Dirichlet distribution theoretically in a meaningful way in this amount of time. Is there a way to give a more applied overview of the distribution? Otherwise I would skip over this part
Section 17.4.1: What is talked about in Canadian parliament?: It would be useful to demonstrate the results of testing different K values with stm() in this section, and potentially a test and training set process like is mentioned at the end of section 17.4
In general I think a lot of the examples in this chapter are not contextualized enough. It would be useful to give more concrete suggestions as to how the cleaning steps may come in handy, or how techniques could be used to draw conclusions about a body of text, i.e. how might you analyze Table 17.1 or the TF/IDF/TF-IDF scores themselves? Is there something that could be drawn from your work with Callie for the Topic Models section?
Have you considered including a section on sentiment analysis? That might be interesting with the horoscope data.

RohanAlexander / telling_stories

Annie's Chpt 17 Notes #50