Exploring removed stop words

Have you considered incorporation exploration into the words that gets removed when you remove stop words?

It is similar to looking at the words in the stop words list (which you always should) but a more limited and reasonable approach since you are only looked at the affected words.

library(tidyverse)
library(tidytext)
library(janeaustenr)

data <- tibble(text = emma) %>%
  unnest_tokens(word, text)

## This step would be added

right_join(data, stop_words, by = "word") %>%
  count(word, sort = TRUE)
#> # A tibble: 728 x 2
#>    word      n
#>    <chr> <int>
#>  1 to    15717
#>  2 the   15603
#>  3 and   14688
#>  4 of    12873
#>  5 i      9531
#>  6 a      9387
#>  7 it     7584
#>  8 her    7386
#>  9 was    7194
#> 10 she    7020
#> # ... with 718 more rows

anti_join(data, stop_words, by = "word")
#> # A tibble: 46,775 x 1
#>    word     
#>    <chr>    
#>  1 emma     
#>  2 jane     
#>  3 austen   
#>  4 volume   
#>  5 chapter  
#>  6 emma     
#>  7 woodhouse
#>  8 handsome 
#>  9 clever   
#> 10 rich     
#> # ... with 46,765 more rows

^{Created on 2018-09-26 by the reprex package (v0.2.1)}

dgrtwo / tidy-text-mining

Exploring removed stop words #51