dgrtwo / tidy-text-mining

Manuscript of the book "Tidy Text Mining with R" by Julia Silge and David Robinson
http://tidytextmining.com
Other
1.31k stars 803 forks source link

Update 04-word-combinations.Rmd #108

Closed aratikrish closed 1 year ago

aratikrish commented 1 year ago

added na.rm = TRUE to the unite statement so bigrams where there is only 1 word do not have NAs in them.

juliasilge commented 1 year ago

Thank you so much for the heads up here @arati2020! I decided to use filter() to remove both the NA bigrams and trigrams.

aratikrish commented 1 year ago

Hi Julia, I just realized that you do not need the filter that you added or the na.rm that I had suggested. You will not have any bigrams or trigrams with NA in them as you are only looking for bi/tri grams. This issue of having NAs arises only when you unnest into ngrams with nmin < n and then separate and unite.

juliasilge commented 1 year ago

Hmmm, I don't think so. Here are the results at line ~50 without the filter():

library(tidyr)

bigrams_separated <- austen_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

# new bigram counts:
bigram_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)

bigram_counts
#> # A tibble: 28,975 × 3
#>    word1   word2         n
#>    <chr>   <chr>     <int>
#>  1 <NA>    <NA>      12242
#>  2 sir     thomas      266
#>  3 miss    crawford    196
#>  4 captain wentworth   143
#>  5 miss    woodhouse   143
#>  6 frank   churchill   114
#>  7 lady    russell     110
#>  8 sir     walter      108
#>  9 lady    bertram     101
#> 10 miss    fairfax      98
#> # … with 28,965 more rows
aratikrish commented 1 year ago

ah got it and agree that the filter is needed here. Different issue from the one I was encountering with ngrams where one of the words is NA but that should not arise here.