dgrtwo / tidy-text-mining

Manuscript of the book "Tidy Text Mining with R" by Julia Silge and David Robinson
http://tidytextmining.com
Other
1.31k stars 803 forks source link

Question about grouping text for pairwise cor and tf idf #97

Closed GabriellaS-K closed 2 years ago

GabriellaS-K commented 2 years ago

I'm trying to learn the pairwise_cor() function in widyr package, and following the example in this book, textmining with r. In the example, you take a book as a section to calculate correlation. However, I would like to look at words across all books, not by comparing words between books. If I wanted to see how often words are correlated throughout the whole dataset of books, to see how correlated words are in the text generally, rather than splitting it by section, how can I do this?

I have a similar question about tf-idf, would there be a way to know the most common rare words in a text overall, rather than by some grouping? I am looking at survey responses, and want to know overall, not by person.

Thank you so much for this fantastic resource

juliasilge commented 2 years ago

When it comes to tf-idf, by definition it applies to documents within a corpus, so you have to think about what you mean by "document" and "corpus" to have tf-idf be a meaningful statistic that is helpful in your analysis. If you are looking at your survey responses and you want to have a sense of the overall most important words (not differences by person) you may just want to look at word frequencies, the most common words used in the survey responses.

If you have survey responses, then that would be the "feature" that you would want to use for pairwise_count() and pairwise_cor(). You wouldn't want to count up or compute the correlation of words that separate survey respondents used. You'll end up with correlations or co-occurrences like this; that analysis was done with exactly this kind of per-survey-respondent counting.

GabriellaS-K commented 2 years ago

Thank you, that is very helpful!!