bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
https://bnosac.github.io/udpipe/en
Mozilla Public License 2.0
213 stars 33 forks source link

Error in `[.data.table` when using special characters #99

Closed etiennebacher closed 2 years ago

etiennebacher commented 2 years ago

Hello,

I may have found a bug that was introduced in version 0.8.6 (last version on CRAN at the time of writing). Using special characters generates the following error:

library(udpipe)
library(tm)

# Text data
textData <- data.frame(
  doc_id = 1,
  text = "tradução"
)

# Download and load model
udModel <- udpipe_download_model(language  = "portuguese-gsd", 
                                 model_dir = getwd())

udModel <- udpipe_load_model('portuguese-gsd-ud-2.5-191206.udpipe')

# Make a corpus 
textCorp <- VCorpus(DataframeSource(textData))
text     <- lapply(textCorp, content)

text <- data.frame(doc_id = 1:nrow(textData), 
                   text   = unlist(text))

udpipe(text, object = udModel)
Error in `[.data.table`(out, , `:=`(term_id, 1L:.N), by = list(doc_id)) : 
  Supplied 2 items to be assigned to group 1 of size 0 in column 'term_id'. The RHS length must either be 1 (single values are ok) or match the LHS length exactly. If you wish to 'recycle' the RHS please use rep() explicitly to make this intent clear to readers of your code.
In addition: Warning message:
In strsplit(x$conllu, "\n", fixed = TRUE) : input string 1 is invalid UTF-8

The error is generated by the letters "çã" in the text (removing them makes the error disappear). Also, I think this error is generated by the following line in the source code: https://github.com/bnosac/udpipe/blob/fdcc4ccd0c1d1e8c37b32572ad04064ba6e1c694/R/udpipe_parse.R#L254

Removing fixed = TRUE in the line above removes the error. In case it helps, fixed = TRUE was introduced in c7557b6.

Session info
- Session info ---------------------------------------------------------
 setting  value                       
 version  R version 4.1.0 (2021-05-18)
 os       Windows 10 x64              
 system   x86_64, mingw32             
 ui       RStudio                     
 language (EN)                        
 collate  French_France.1252          
 ctype    French_France.1252          
 tz       Europe/Paris                
 date     2021-10-18                  

- Packages -------------------------------------------------------------
 package     * version date       lib source            
 cli           3.0.1   2021-07-17 [1] CRAN (R 4.1.1)    
 data.table    1.14.2  2021-09-27 [1] standard (@1.14.2)
 lattice       0.20-45 2021-09-22 [1] CRAN (R 4.1.1)    
 Matrix        1.3-4   2021-06-01 [1] CRAN (R 4.1.0)    
 NLP         * 0.2-1   2020-10-14 [1] standard (@0.2-1) 
 Rcpp          1.0.7   2021-07-07 [1] standard (@1.0.7) 
 rstudioapi    0.13    2020-11-12 [1] standard (@0.13)  
 sessioninfo   1.1.1   2018-11-05 [1] standard (@1.1.1) 
 slam          0.1-48  2020-12-03 [1] standard (@0.1-48)
 tm          * 0.7-8   2020-11-18 [1] standard (@0.7-8) 
 udpipe      * 0.8.6   2021-06-01 [1] standard (@0.8.6) 
 withr         2.4.2   2021-04-18 [1] CRAN (R 4.1.0)    
 xml2          1.3.2   2020-04-23 [1] CRAN (R 4.1.0)    

[1] C:/Users/etienne/Documents/R/R-4.1.0/library

Best,

jwijffels commented 2 years ago

What happens if you put your text in utf8 encoding as indicated in the help.

etiennebacher commented 2 years ago

Indeed using text = enc2utf8("tradução") works. Thanks, and sorry for the inconvenience