Removing stop words-> which ones have I removed?

GabriellaS-K commented 4 years ago

Hi,

Firstly thank you for the text mining with R book. It has proved a fantastic resource to me, I am just starting with text analysis so I have used it a lot.

This isn't an issue with anything, I was just wondering if anyone knows of a way to see which stop words have been removed from the data? I understand that the stop_words dataset in the tidytext package contains stop words from three lexicons, but I am interested in knowing (and amending for my data if needed) the words that have been found and removed as a result in my dataset.

Thanks a lot! Gabriella

juliasilge commented 4 years ago

You can use a semi_join() to get that out:

library(tidyverse)
library(tidytext)

data("prideprejudice", package = "janeaustenr")

tidy_p_and_p <- tibble(text = prideprejudice) %>%
  unnest_tokens(word, text) 

## remove stop words
tidy_p_and_p %>%
  anti_join(get_stopwords())
#> Joining, by = "word"
#> # A tibble: 54,831 x 1
#>    word        
#>    <chr>       
#>  1 pride       
#>  2 prejudice   
#>  3 jane        
#>  4 austen      
#>  5 chapter     
#>  6 1           
#>  7 truth       
#>  8 universally 
#>  9 acknowledged
#> 10 single      
#> # … with 54,821 more rows

## which stop words are in Pride & Prejudice?
get_stopwords() %>%
  semi_join(tidy_p_and_p)
#> Joining, by = "word"
#> # A tibble: 134 x 2
#>    word      lexicon 
#>    <chr>     <chr>   
#>  1 i         snowball
#>  2 me        snowball
#>  3 my        snowball
#>  4 myself    snowball
#>  5 we        snowball
#>  6 our       snowball
#>  7 ours      snowball
#>  8 ourselves snowball
#>  9 you       snowball
#> 10 your      snowball
#> # … with 124 more rows

^{Created on 2020-07-24 by the reprex package (v0.3.0.9001)}

GabriellaS-K commented 4 years ago

Wonderful, thank you so much!!!

dgrtwo / tidy-text-mining

Removing stop words-> which ones have I removed? #77