dgrtwo / tidy-text-mining

Manuscript of the book "Tidy Text Mining with R" by Julia Silge and David Robinson
1.32k stars 806 forks source link

Exploring removed stop words #51

Closed EmilHvitfeldt closed 4 years ago

EmilHvitfeldt commented 6 years ago

Have you considered incorporation exploration into the words that gets removed when you remove stop words?

It is similar to looking at the words in the stop words list (which you always should) but a more limited and reasonable approach since you are only looked at the affected words.


data <- tibble(text = emma) %>%
  unnest_tokens(word, text)

## This step would be added

right_join(data, stop_words, by = "word") %>%
  count(word, sort = TRUE)
#> # A tibble: 728 x 2
#>    word      n
#>    <chr> <int>
#>  1 to    15717
#>  2 the   15603
#>  3 and   14688
#>  4 of    12873
#>  5 i      9531
#>  6 a      9387
#>  7 it     7584
#>  8 her    7386
#>  9 was    7194
#> 10 she    7020
#> # ... with 718 more rows

anti_join(data, stop_words, by = "word")
#> # A tibble: 46,775 x 1
#>    word     
#>    <chr>    
#>  1 emma     
#>  2 jane     
#>  3 austen   
#>  4 volume   
#>  5 chapter  
#>  6 emma     
#>  7 woodhouse
#>  8 handsome 
#>  9 clever   
#> 10 rich     
#> # ... with 46,765 more rows

Created on 2018-09-26 by the reprex package (v0.2.1)

juliasilge commented 4 years ago

Detailed discussion of stop words now available in SMLTAR chapter