dgrtwo / tidy-text-mining

Manuscript of the book "Tidy Text Mining with R" by Julia Silge and David Robinson
http://tidytextmining.com
Other
1.32k stars 806 forks source link

tidy_tweets - invalid argument type #39

Closed cpjfb closed 7 years ago

cpjfb commented 7 years ago

Hi

First, congratulations, I'm loving this book! Great work.

Now, when I run the chunk that will unnest_tokens with the regex on tweets to get tidy_tweets (in 07-tweet-archives.Rmd ), I have an "Error: invalid argument type".

Any idea why?

Thanks!

juliasilge commented 7 years ago

Hmmmm, I'm not able to reproduce this problem. This the code you mean, right?

library(lubridate)
library(ggplot2)
library(dplyr)
library(readr)

tweets_julia <- read_csv("data/tweets_julia.csv")
#> Parsed with column specification:
#> cols(
#>   tweet_id = col_double(),
#>   in_reply_to_status_id = col_double(),
#>   in_reply_to_user_id = col_double(),
#>   timestamp = col_character(),
#>   source = col_character(),
#>   text = col_character(),
#>   retweeted_status_id = col_double(),
#>   retweeted_status_user_id = col_double(),
#>   retweeted_status_timestamp = col_character(),
#>   expanded_urls = col_character()
#> )
tweets_dave <- read_csv("data/tweets_julia.csv")
#> Parsed with column specification:
#> cols(
#>   tweet_id = col_double(),
#>   in_reply_to_status_id = col_double(),
#>   in_reply_to_user_id = col_double(),
#>   timestamp = col_character(),
#>   source = col_character(),
#>   text = col_character(),
#>   retweeted_status_id = col_double(),
#>   retweeted_status_user_id = col_double(),
#>   retweeted_status_timestamp = col_character(),
#>   expanded_urls = col_character()
#> )
tweets <- bind_rows(tweets_julia %>% 
                      mutate(person = "Julia"),
                    tweets_dave %>% 
                      mutate(person = "David")) %>%
  mutate(timestamp = ymd_hms(timestamp))

library(tidytext)
library(stringr)

replace_reg <- "https://t.co/[A-Za-z\\d]+|http://[A-Za-z\\d]+|&amp;|&lt;|&gt;|RT|https"
unnest_reg <- "([^A-Za-z_\\d#@']|'(?![A-Za-z_\\d#@]))"
tidy_tweets <- tweets %>% 
  filter(!str_detect(text, "^RT")) %>%
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = unnest_reg) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "[a-z]"))

tidy_tweets
#> # A tibble: 149,144 x 11
#>    tweet_… in_re… in_r… timestamp           source retw… retw… retw… expa…
#>      <dbl>  <dbl> <dbl> <dttm>              <chr>  <dbl> <dbl> <chr> <chr>
#>  1  6.78e⁸     NA    NA 2008-02-05 00:00:00 "<a h…    NA    NA <NA>  <NA> 
#>  2  6.78e⁸     NA    NA 2008-02-05 00:00:00 "<a h…    NA    NA <NA>  <NA> 
#>  3  6.78e⁸     NA    NA 2008-02-05 00:00:00 "<a h…    NA    NA <NA>  <NA> 
#>  4  6.78e⁸     NA    NA 2008-02-05 00:00:00 "<a h…    NA    NA <NA>  <NA> 
#>  5  6.78e⁸     NA    NA 2008-02-05 00:00:00 "<a h…    NA    NA <NA>  <NA> 
#>  6  6.78e⁸     NA    NA 2008-02-05 00:00:00 "<a h…    NA    NA <NA>  <NA> 
#>  7  6.78e⁸     NA    NA 2008-02-05 00:00:00 "<a h…    NA    NA <NA>  <NA> 
#>  8  6.78e⁸     NA    NA 2008-02-05 00:00:00 "<a h…    NA    NA <NA>  <NA> 
#>  9  6.78e⁸     NA    NA 2008-02-05 00:00:00 "<a h…    NA    NA <NA>  <NA> 
#> 10  6.78e⁸     NA    NA 2008-02-05 00:00:00 "<a h…    NA    NA <NA>  <NA> 
#> # ... with 149,134 more rows, and 2 more variables: person <chr>, word
#> #   <chr>

I would recommend starting by updating your package versions? Not sure what the problem might be.

cpjfb commented 7 years ago

Oops, yes, that was the issue, it works like a charm now that I've updated all my packages.

Thanks a lot!