suggestion: stem before some analyses

dgrtwo / tidy-text-mining

Manuscript of the book "Tidy Text Mining with R" by Julia Silge and David Robinson

http://tidytextmining.com

Other

1.32k stars 806 forks source link

suggestion: stem before some analyses #21

Closed eijoac closed 7 years ago

eijoac commented 7 years ago

For some of the analyses in the book, it's better to stem the words first. For example, in the analysis of inauguration speeches in chapter 6, it makes more sense to group together words like job/jobs, union/unions, constitution/constitutions, etc. before tf-idf calculation and frequency time series plot.

I understand that stemming is not integrated in the tidytext package for a good reason (https://github.com/juliasilge/tidytext/issues/17). Perhaps that's why you try to avoid stemming in the book?

xkuang commented 7 years ago

Sentiment analysis by word

"Error in summarize(., occurences = n(), contribution = sum(score)) : argument "by" is missing, with no default"

dgrtwo commented 7 years ago

@xkuang This isn't related to this issue, but the problem is that you have another package loaded after dplyr (most likely Hmisc) that has a summarize function that masks dplyr's. If you type summarize you'd see, and if you restart and be sure to load Hmisc before loading dplyr this would be fixed. See here for more!

juliasilge commented 7 years ago

I'm going to close this issue, because although stemming is an important NLP task, there are other packages that implement it in R and we don't focus on it in this book. We do plan to add examples with stemming in a vignette for tidytext eventually.