cbail / textnets

R package to perform automated text analysis using network techniques
MIT License
211 stars 62 forks source link

Error with PrepText function using example sotu data #11

Open angelhsu05 opened 4 years ago

angelhsu05 commented 4 years ago

I’m getting an “Check_input” error when I try to run PrepText using the sotu example, even though the text is type character. I tried creating my own tf-idf data frame using tidy text so I could still use the visualization functions in this package but I wasn’t sure what the outputs of PrepText and CreateTextnet look like to troubleshoot. Thanks for your help!

Dat-Vuong07 commented 4 years ago

Hi,

I also have the same problem, here what I got when try to run this code

sotu_firsts_nouns <- PrepText(textdata = sotu_firsts, groupvar = "president", textvar = "sotu_text", node_type = "groups", tokenizer = "words", pos = "nouns", remove_stop_words = TRUE, compound_nouns = TRUE)

Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.4/master/inst/udpipe-ud-2.4-190531/english-ewt-ud-2.4-190531.udpipe to /Users/vuongdat/OneDrive - Aarhus universitet/Applied Data Science/Share Data Folder/2. Raw Data/Reddit Data/english-ewt-ud-2.4-190531.udpipe
Visit https://github.com/jwijffels/udpipe.models.ud.2.4 for model license details
trying URL 'https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.4/master/inst/udpipe-ud-2.4-190531/english-ewt-ud-2.4-190531.udpipe'
Content type 'application/octet-stream' length 16477964 bytes (15.7 MB)
==================================================
downloaded 15.7 MB
Error in check_input(x) : 
  Input must be a character vector of any length or a list of character
  vectors, each of which has a length of 1.
In addition: Warning message:
'unnest_tokens_' is deprecated.
Use 'unnest_tokens' instead.
See help("Deprecated") 

Here is my section Information.

Thank you in advance,

─ Session info ──────────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.6.1 (2019-07-05)
 os       macOS Catalina 10.15.3      
 system   x86_64, darwin15.6.0        
 ui       RStudio                     
 language (EN)                        
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       Europe/Copenhagen           
 date     2020-05-03   

─ Packages ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
 package      * version  date       lib source                         
 assertthat     0.2.1    2019-03-21 [1] CRAN (R 3.6.0)                 
 backports      1.1.6    2020-04-05 [1] CRAN (R 3.6.2)                 
 callr          3.4.3    2020-03-28 [1] CRAN (R 3.6.2)                 
 cli            2.0.2    2020-02-28 [1] CRAN (R 3.6.0)                 
 colorspace     1.4-1    2019-03-18 [1] CRAN (R 3.6.0)                 
 crayon         1.3.4    2017-09-16 [1] CRAN (R 3.6.0)                 
 curl           4.3      2019-12-02 [1] CRAN (R 3.6.0)                 
 data.table     1.12.8   2019-12-09 [1] CRAN (R 3.6.0)                 
 desc           1.2.0    2018-05-01 [1] CRAN (R 3.6.0)                 
 devtools     * 2.2.1    2019-09-24 [1] CRAN (R 3.6.0)                 
 digest         0.6.25   2020-02-23 [1] CRAN (R 3.6.0)                 
 dplyr        * 0.8.5    2020-03-07 [1] CRAN (R 3.6.0)                 
 ellipsis       0.3.0    2019-09-20 [1] CRAN (R 3.6.0)                 
 fansi          0.4.1    2020-01-08 [1] CRAN (R 3.6.0)                 
 farver         2.0.3    2020-01-16 [1] CRAN (R 3.6.0)                 
 fs             1.4.1    2020-04-04 [1] CRAN (R 3.6.2)                 
 generics       0.0.2    2018-11-29 [1] CRAN (R 3.6.0)                 
 ggforce        0.3.1    2019-08-20 [1] CRAN (R 3.6.0)                 
 ggplot2      * 3.3.0    2020-03-05 [1] CRAN (R 3.6.0)                 
 ggraph       * 2.0.2    2020-03-17 [1] CRAN (R 3.6.0)                 
 ggrepel        0.8.2    2020-03-08 [1] CRAN (R 3.6.0)                 
 glue           1.4.0    2020-04-03 [1] CRAN (R 3.6.2)                 
 graphlayouts   0.7.0    2020-04-25 [1] CRAN (R 3.6.2)                 
 gridExtra      2.3      2017-09-09 [1] CRAN (R 3.6.0)                 
 gtable         0.3.0    2019-03-25 [1] CRAN (R 3.6.0)                 
 htmltools      0.4.0    2019-10-04 [1] CRAN (R 3.6.0)                 
 htmlwidgets    1.5.1    2019-10-08 [1] CRAN (R 3.6.1)                 
 igraph         1.2.5    2020-03-19 [1] CRAN (R 3.6.0)                 
 janeaustenr    0.1.5    2017-06-10 [1] CRAN (R 3.6.0)                 
 lattice        0.20-38  2018-11-04 [1] CRAN (R 3.6.1)                 
 lifecycle      0.2.0    2020-03-06 [1] CRAN (R 3.6.0)                 
 magrittr       1.5      2014-11-22 [1] CRAN (R 3.6.0)                 
 MASS           7.3-51.4 2019-03-31 [1] CRAN (R 3.6.1)                 
 Matrix         1.2-17   2019-03-22 [1] CRAN (R 3.6.1)                 
 memoise        1.1.0    2017-04-21 [1] CRAN (R 3.6.0)                 
 munsell        0.5.0    2018-06-12 [1] CRAN (R 3.6.0)                 
 networkD3    * 0.4      2017-03-18 [1] CRAN (R 3.6.0)                 
 pillar         1.4.3    2019-12-20 [1] CRAN (R 3.6.1)                 
 pkgbuild       1.0.7    2020-04-25 [1] CRAN (R 3.6.2)                 
 pkgconfig      2.0.3    2019-09-22 [1] CRAN (R 3.6.1)                 
 pkgload        1.0.2    2018-10-29 [1] CRAN (R 3.6.0)                 
 plyr           1.8.6    2020-03-03 [1] CRAN (R 3.6.0)                 
 polyclip       1.10-0   2019-03-14 [1] CRAN (R 3.6.0)                 
 prettyunits    1.1.1    2020-01-24 [1] CRAN (R 3.6.1)                 
 processx       3.4.2    2020-02-09 [1] CRAN (R 3.6.0)                 
 ps             1.3.2    2020-02-13 [1] CRAN (R 3.6.0)                 
 purrr          0.3.4    2020-04-17 [1] CRAN (R 3.6.2)                 
 R6             2.4.1    2019-11-12 [1] CRAN (R 3.6.1)                 
 Rcpp           1.0.4    2020-03-17 [1] CRAN (R 3.6.0)                 
 remotes        2.1.0    2019-06-24 [1] CRAN (R 3.6.0)                 
 reshape2       1.4.4    2020-04-09 [1] CRAN (R 3.6.2)                 
 rlang          0.4.5    2020-03-01 [1] CRAN (R 3.6.0)                 
 rprojroot      1.3-2    2018-01-03 [1] CRAN (R 3.6.0)                 
 rstudioapi     0.11     2020-02-07 [1] CRAN (R 3.6.0)                 
 rversions      2.0.1    2019-12-03 [1] CRAN (R 3.6.1)                 
 scales         1.1.0    2019-11-18 [1] CRAN (R 3.6.1)                 
 sessioninfo    1.1.1    2018-11-05 [1] CRAN (R 3.6.0)                 
 SnowballC      0.7.0    2020-04-01 [1] CRAN (R 3.6.2)                 
 stringi        1.4.6    2020-02-17 [1] CRAN (R 3.6.0)                 
 stringr        1.4.0    2019-02-10 [1] CRAN (R 3.6.0)                 
 testthat       2.3.2    2020-03-02 [1] CRAN (R 3.6.0)                 
 textnets     * 0.1.1    2020-05-03 [1] Github (cbail/textnets@bc688a8)
 tibble         3.0.1    2020-04-20 [1] CRAN (R 3.6.2)                 
 tidygraph      1.1.2    2019-02-18 [1] CRAN (R 3.6.0)                 
 tidyr          1.0.2    2020-01-24 [1] CRAN (R 3.6.0)                 
 tidyselect     1.0.0    2020-01-27 [1] CRAN (R 3.6.0)                 
 tidytext     * 0.2.4    2020-04-17 [1] CRAN (R 3.6.2)                 
 tokenizers     0.2.1    2018-03-29 [1] CRAN (R 3.6.0)                 
 tweenr         1.0.1    2018-12-14 [1] CRAN (R 3.6.0)                 
 udpipe       * 0.8.3    2019-07-05 [1] CRAN (R 3.6.0)                 
 usethis      * 1.6.1    2020-04-29 [1] CRAN (R 3.6.2)                 
 vctrs          0.2.4    2020-03-10 [1] CRAN (R 3.6.0)                 
 viridis        0.5.1    2018-03-29 [1] CRAN (R 3.6.0)                 
 viridisLite    0.3.0    2018-02-01 [1] CRAN (R 3.6.0)                 
 withr          2.2.0    2020-04-20 [1] CRAN (R 3.6.2)                 
 xml2           1.2.2    2019-08-09 [1] CRAN (R 3.6.0)     
angelhsu05 commented 4 years ago

same, even though the data are characters or character lists, doesn't matter which dataset, the same error persists.

angelhsu05 commented 4 years ago

it seems like it's an issue with unnest_tokens within the function, there are some help pages but it's not clear how to resolve it. Still looking into it ...

davidycliao commented 4 years ago

@Dat-Vuong07 @angelhsu05 Did you use dataset from the package? If you use your own textual data, firstly, please try to load own dataset with the package using read_csv from tidyverse instead of read.csv from base. read_csv function automatically identifies the datatype by itself. However, you might need to double check the textvar of datatype is character or not.

If that is not a problem, please set compound_nouns off like this compound_nouns = FALSE. I guess you probably use a language other than English. The package parse the text using udpipe library. The package of udpipe parses text based on pre-trained language model in CONLL-U format and returns dep_rel column that stores various features of language dependency. And there is no compound feature from the data.

angelhsu05 commented 4 years ago

Yes, I was using the 'sotu' dataset from the package and so am not reading the data in from csv, but instead loading directly from the package. Changing compound_nouns=FALSE doesn't work either.

I can use the tidytext and udpipe libraries instead of the PrepText function to get the tidytextobject to use the other functions, but it would be great to be able to get PrepText working as i'm new to network analysis.

I am also getting a message that unnest_tokens_ is deprecated and to use unnest_tokens instead, so i updated PrepText with that but I'm still getting the same error about the data not being a character vector or list of character vectors, which is clearly is in the sotu dataset.

davidycliao commented 4 years ago

Have you checked datatype of the sotu being loaded ? Particularly, the textvar that you are going to use.

Dat-Vuong07 commented 4 years ago

Hi @yl17124 ,

I also have the same problem.

Here is my code. I already changed the compound_nouns = FALSE

library(textnets)
data("sotu")

str(sotu)

sotu_firsts <- sotu %>% group_by(president) %>% slice(1L)

sotu_firsts_nouns <- PrepText(sotu_firsts, groupvar = "president", 
                              textvar = "sotu_text", 
                              node_type = "groups", 
                              tokenizer = "words", 
                              pos = "nouns", 
                              remove_stop_words = TRUE, compound_nouns = FALSE)

And this is the results

1. The sotu input look fine with variable sotu_text as character

> str(sotu)
'data.frame':   236 obs. of  6 variables:
 $ sotu_text   : chr  "Fellow-Citizens of the Senate and House of Representatives: \n\nI embrace with great satisfaction the opportuni"| __truncated__ "\n\n Fellow-Citizens of the Senate and House of Representatives: \n\nIn meeting you again I feel much satisfact"| __truncated__ "\n\n Fellow-Citizens of the Senate and House of Representatives: \n\n \"In vain may we expect peace with the In"| __truncated__ "Fellow-Citizens of the Senate and House of Representatives: \n\nIt is some abatement of the satisfaction with w"| __truncated__ ...
 $ president   : chr  "George Washington" "George Washington" "George Washington" "George Washington" ...
 $ year        : int  1790 1790 1791 1792 1793 1794 1795 1796 1797 1798 ...
 $ years_active: chr  "1789-1793" "1789-1793" "1789-1793" "1789-1793" ...
 $ party       : chr  "Nonpartisan" "Nonpartisan" "Nonpartisan" "Nonpartisan" ...
 $ sotu_type   : chr  "speech" "speech" "speech" "speech" ...

2. However, it can't run the PrepText function

Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.4/master/inst/udpipe-ud-2.4-190531/english-ewt-ud-2.4-190531.udpipe to /Users/vuongdat/OneDrive - Aarhus universitet/Applied Data Science/Share Data Folder/2. Raw Data/Reddit Data/english-ewt-ud-2.4-190531.udpipe
Visit https://github.com/jwijffels/udpipe.models.ud.2.4 for model license details
trying URL 'https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.4/master/inst/udpipe-ud-2.4-190531/english-ewt-ud-2.4-190531.udpipe'
Content type 'application/octet-stream' length 16477964 bytes (15.7 MB)
==================================================
downloaded 15.7 MB

Error in check_input(x) : 
  Input must be a character vector of any length or a list of character
  vectors, each of which has a length of 1.
In addition: Warning message:
'unnest_tokens_' is deprecated.
Use 'unnest_tokens' instead.
See help("Deprecated") 

Below is my session Information

R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.3

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] textnets_0.1.1 networkD3_0.4  ggraph_2.0.2   ggplot2_3.3.0  udpipe_0.8.3   dplyr_0.8.5   

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4         plyr_1.8.6         pillar_1.4.3       compiler_3.6.1     tokenizers_0.2.1   viridis_0.5.1     
 [7] tools_3.6.1        digest_0.6.25      viridisLite_0.3.0  lifecycle_0.2.0    tibble_3.0.1       gtable_0.3.0      
[13] lattice_0.20-38    pkgconfig_2.0.3    rlang_0.4.5        Matrix_1.2-17      tidygraph_1.1.2    igraph_1.2.5      
[19] rstudioapi_0.11    ggrepel_0.8.2      gridExtra_2.3      stringr_1.4.0      janeaustenr_0.1.5  withr_2.2.0       
[25] generics_0.0.2     htmlwidgets_1.5.1  graphlayouts_0.7.0 vctrs_0.2.4        grid_3.6.1         tidyselect_1.0.0  
[31] glue_1.4.0         data.table_1.12.8  R6_2.4.1           polyclip_1.10-0    reshape2_1.4.4     purrr_0.3.4       
[37] tidyr_1.0.2        tweenr_1.0.1       farver_2.0.3       magrittr_1.5       SnowballC_0.7.0    htmltools_0.4.0   
[43] scales_1.1.0       ellipsis_0.3.0     MASS_7.3-51.4      tidytext_0.2.4     assertthat_0.2.1   ggforce_0.3.1     
[49] colorspace_1.4-1   stringi_1.4.6      munsell_0.5.0      crayon_1.3.4      
angelhsu05 commented 4 years ago

yes, the textvar for the sotu example is type character.

sotork commented 4 years ago

You need to change the "sotu_text" variable name to "textvar" to make it work.

I had the same issue and it wasn't until I saw under the hood of the PrepText function how the input was evaluated in the unnesttokens function (which is deprecated, BTW).

cbail commented 4 years ago

Hi folks- thanks for pitching in to answer this question, I'm sorry I was not available I am busy planning for a major educational event this summer (the Summer Institutes in Computational Social Science: https://compsocialscience.github.io/summer-institute/). Ok to close this now @angelhsu05 ? Or is it still not working.

angelhsu05 commented 4 years ago

yes that was the trick, thanks @sotork! Perhaps worth updating the example code snippet below:

names(sotu_first_speeches)[1] <- "textvar" prepped_sotu <- PrepText(sotu_first_speeches, groupvar = "president", textvar = "textvar", node_type = "groups", tokenizer = "words", pos = "nouns", remove_stop_words = TRUE, compound_nouns = TRUE)

davidycliao commented 4 years ago

@angelhsu05 perhaps avoid naming your text variable as text in order not to coincide the same variable name text from unnest_tokens in tidytext. worth to look through this

cbail commented 4 years ago

Hi all- I'm unable to reproduce this error. I want to make sure I am following the solution correctly-- did you wind up changing textvar="textvar" in order to make it work? Does that mean you renamed the variable in the sotu dataset?

sotork commented 4 years ago

Yes Chris...I renamed it. It was "sotu_text", it needs to be "textvar" to make the function work.

kelseygonzalez commented 4 years ago

I was having the exact same issue as everyone else on this thread. Renaming sotu_text to textvar solved it for some mysterious reason, as others have discovered. It sounds like the underlying code needs to be updated to use 'unnest_tokens' (new) instead of 'unnesttokens' (deprecated).

This fails:

library(textnets)
data("sotu")
sotu_first_speeches <- sotu %>% 
  group_by(president) %>% 
  slice(1L) 
prepped_sotu <- PrepText(sotu_first_speeches, 
                         groupvar = "president", 
                         textvar = "textvar", 
                         node_type = "groups", 
                         tokenizer = "words", 
                         pos = "nouns", 
                         remove_stop_words = TRUE,
                         compound_nouns = TRUE)

This functions:

library(textnets)
data("sotu")
sotu_first_speeches <- sotu %>% 
  group_by(president) %>% 
  slice(1L) %>% 
  ungroup() %>% 
  rename(textvar = sotu_text)
prepped_sotu <- PrepText(sotu_first_speeches, 
                         groupvar = "president", 
                         textvar = "textvar", 
                         node_type = "groups", 
                         tokenizer = "words", 
                         pos = "nouns", 
                         remove_stop_words = TRUE,
                         compound_nouns = TRUE)
cbail commented 4 years ago

Hi all: I just pushed a fix for this (it turned out to be an issue with variable indirection created by R 4.0). Could one or more of you please try a fresh install (and try rerunning the example code) to verify that you no longer need to change the textvar column name as @kelseygonzalez did above? Thank you!

kelseygonzalez commented 4 years ago

It runs without the textvar column name now!