Closed GabriellaS-K closed 4 years ago
You can use a semi_join()
to get that out:
library(tidyverse)
library(tidytext)
data("prideprejudice", package = "janeaustenr")
tidy_p_and_p <- tibble(text = prideprejudice) %>%
unnest_tokens(word, text)
## remove stop words
tidy_p_and_p %>%
anti_join(get_stopwords())
#> Joining, by = "word"
#> # A tibble: 54,831 x 1
#> word
#> <chr>
#> 1 pride
#> 2 prejudice
#> 3 jane
#> 4 austen
#> 5 chapter
#> 6 1
#> 7 truth
#> 8 universally
#> 9 acknowledged
#> 10 single
#> # … with 54,821 more rows
## which stop words are in Pride & Prejudice?
get_stopwords() %>%
semi_join(tidy_p_and_p)
#> Joining, by = "word"
#> # A tibble: 134 x 2
#> word lexicon
#> <chr> <chr>
#> 1 i snowball
#> 2 me snowball
#> 3 my snowball
#> 4 myself snowball
#> 5 we snowball
#> 6 our snowball
#> 7 ours snowball
#> 8 ourselves snowball
#> 9 you snowball
#> 10 your snowball
#> # … with 124 more rows
Created on 2020-07-24 by the reprex package (v0.3.0.9001)
Wonderful, thank you so much!!!
Hi,
Firstly thank you for the text mining with R book. It has proved a fantastic resource to me, I am just starting with text analysis so I have used it a lot.
This isn't an issue with anything, I was just wondering if anyone knows of a way to see which stop words have been removed from the data? I understand that the stop_words dataset in the tidytext package contains stop words from three lexicons, but I am interested in knowing (and amending for my data if needed) the words that have been found and removed as a result in my dataset.
Thanks a lot! Gabriella