Closed aratikrish closed 2 years ago
Thank you so much for the heads up here @arati2020! I decided to use filter()
to remove both the NA
bigrams and trigrams.
Hi Julia, I just realized that you do not need the filter that you added or the na.rm that I had suggested. You will not have any bigrams or trigrams with NA in them as you are only looking for bi/tri grams. This issue of having NAs arises only when you unnest into ngrams with nmin < n and then separate and unite.
Hmmm, I don't think so. Here are the results at line ~50 without the filter()
:
library(tidyr)
bigrams_separated <- austen_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
# new bigram counts:
bigram_counts <- bigrams_filtered %>%
count(word1, word2, sort = TRUE)
bigram_counts
#> # A tibble: 28,975 × 3
#> word1 word2 n
#> <chr> <chr> <int>
#> 1 <NA> <NA> 12242
#> 2 sir thomas 266
#> 3 miss crawford 196
#> 4 captain wentworth 143
#> 5 miss woodhouse 143
#> 6 frank churchill 114
#> 7 lady russell 110
#> 8 sir walter 108
#> 9 lady bertram 101
#> 10 miss fairfax 98
#> # … with 28,965 more rows
ah got it and agree that the filter is needed here. Different issue from the one I was encountering with ngrams where one of the words is NA but that should not arise here.
added na.rm = TRUE to the unite statement so bigrams where there is only 1 word do not have NAs in them.