gagolews / stringi

Fast and portable character string processing in R (with the Unicode ICU)
https://stringi.gagolewski.com/
Other
304 stars 44 forks source link

Force the copying of ALTREP data #354

Closed ltrgoddard closed 3 years ago

ltrgoddard commented 5 years ago

I'm getting embedded nul in string intermittently from stri_trans_tolower(), called via stringr::str_to_lower(). I can't share a reprex as it's private data, unfortunately, but I'm confident that there isn't actually any null character in the string. It occurs when lowercasing a column of email addresses in a large data frame (>500,000 rows) through dplyr::mutate(), almost always at row 180.

I think it's probably something to do with vroom's lazy-loading of character columns, as the issue goes away when I ingest the data withreadr::read_tsv() instead. Opening an issue here first as it's a stringi error, but let me know if it seems to be more on the vroom side.

Possibly related to #319?

gagolews commented 5 years ago

Apologies, I'm gonna need a fully reproducible example.

ltrgoddard commented 5 years ago

Fair enough. I'm fairly sure it's a memory issue, so it's difficult to reproduce consistently -- it will occur in a fresh R session, but then not when the code is run again in the same session, for example. See if the code below works for you, though, using this data as "test.csv".

# Unnecessary packages
library(tidyverse)
library(lubridate)
library(gganimate)

# Necessary packages
library(stringi)
library(vroom)

data <- vroom("test.csv", col_names = F)
stri_trans_tolower(data$X1)

For me, this consistently gives the following output:

Error in stri_trans_tolower(data$X1) : 
  embedded nul in string: 'pfnin2831643\0v'
Execution halted

Clues:

I'm using R 3.5.3 on macOS Mojave, with the latest GitHub version of stringi and the latest CRAN versions of all the other packages.

gagolews commented 5 years ago

Can you give me the details on your devtools::session_info() when running Rscript ? Just add print(devtools::session_info()) in the .R script right before the supposedly buggy call

ltrgoddard commented 5 years ago

Sure -- here you go.

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 3.5.3 (2019-03-11)
 os       macOS Mojave 10.14.5
 system   x86_64, darwin15.6.0
 ui       X11
 language (EN)
 collate  en_GB.UTF-8
 ctype    en_GB.UTF-8
 tz       Europe/London
 date     2019-06-14

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version    date       lib source
 assertthat    0.2.1      2019-03-21 [1] CRAN (R 3.5.2)
 backports     1.1.4      2019-04-10 [1] CRAN (R 3.5.2)
 broom         0.5.2      2019-04-07 [1] CRAN (R 3.5.2)
 callr         3.2.0      2019-03-15 [1] CRAN (R 3.5.2)
 cellranger    1.1.0      2016-07-27 [1] CRAN (R 3.5.0)
 cli           1.1.0      2019-03-19 [1] CRAN (R 3.5.2)
 colorspace    1.4-1      2019-03-18 [1] CRAN (R 3.5.2)
 crayon        1.3.4      2017-09-16 [1] CRAN (R 3.5.0)
 desc          1.2.0      2018-05-01 [1] CRAN (R 3.5.0)
 devtools      2.0.2      2019-04-08 [1] CRAN (R 3.5.2)
 digest        0.6.19     2019-05-20 [1] CRAN (R 3.5.3)
 dplyr       * 0.8.0.1    2019-02-15 [1] CRAN (R 3.5.2)
 farver        1.1.0      2018-11-20 [1] CRAN (R 3.5.0)
 forcats     * 0.4.0      2019-02-17 [1] CRAN (R 3.5.2)
 fs            1.2.7      2019-03-19 [1] CRAN (R 3.5.2)
 generics      0.0.2      2018-11-29 [1] CRAN (R 3.5.0)
 gganimate   * 1.0.3      2019-04-02 [1] CRAN (R 3.5.2)
 ggplot2     * 3.1.1      2019-04-07 [1] CRAN (R 3.5.2)
 glue          1.3.1      2019-03-12 [1] CRAN (R 3.5.2)
 gtable        0.3.0      2019-03-25 [1] CRAN (R 3.5.2)
 haven         2.1.0      2019-02-19 [1] CRAN (R 3.5.2)
 hms           0.4.2      2018-03-10 [1] CRAN (R 3.5.0)
 httr          1.4.0      2018-12-11 [1] CRAN (R 3.5.0)
 jsonlite      1.6        2018-12-07 [1] CRAN (R 3.5.0)
 lattice       0.20-38    2018-11-04 [1] CRAN (R 3.5.3)
 lazyeval      0.2.2      2019-03-15 [1] CRAN (R 3.5.2)
 lubridate   * 1.7.4      2018-04-11 [1] CRAN (R 3.5.0)
 magrittr      1.5        2014-11-22 [1] CRAN (R 3.5.0)
 memoise       1.1.0      2017-04-21 [1] CRAN (R 3.5.0)
 modelr        0.1.4      2019-02-18 [1] CRAN (R 3.5.2)
 munsell       0.5.0      2018-06-12 [1] CRAN (R 3.5.0)
 nlme          3.1-137    2018-04-07 [1] CRAN (R 3.5.3)
 pillar        1.4.0      2019-05-11 [1] CRAN (R 3.5.3)
 pkgbuild      1.0.3      2019-03-20 [1] CRAN (R 3.5.2)
 pkgconfig     2.0.2      2018-08-16 [1] CRAN (R 3.5.0)
 pkgload       1.0.2      2018-10-29 [1] CRAN (R 3.5.0)
 plyr          1.8.4      2016-06-08 [1] CRAN (R 3.5.0)
 prettyunits   1.0.2      2015-07-13 [1] CRAN (R 3.5.0)
 processx      3.3.0      2019-03-10 [1] CRAN (R 3.5.2)
 progress      1.2.2      2019-05-16 [1] CRAN (R 3.5.2)
 ps            1.3.0      2018-12-21 [1] CRAN (R 3.5.0)
 purrr       * 0.3.2      2019-03-15 [1] CRAN (R 3.5.2)
 R6            2.4.0      2019-02-14 [1] CRAN (R 3.5.2)
 Rcpp          1.0.1      2019-03-17 [1] CRAN (R 3.5.2)
 readr       * 1.3.1      2018-12-21 [1] CRAN (R 3.5.0)
 readxl        1.3.1.9000 2019-05-20 [1] Github (tidyverse/readxl@15fde89)
 remotes       2.0.4      2019-04-10 [1] CRAN (R 3.5.2)
 rlang         0.3.4.9003 2019-05-10 [1] Github (r-lib/rlang@6a232c0)
 rprojroot     1.3-2      2018-01-03 [1] CRAN (R 3.5.0)
 rstudioapi    0.10       2019-03-19 [1] CRAN (R 3.5.2)
 rvest         0.3.3      2019-04-11 [1] CRAN (R 3.5.2)
 scales        1.0.0      2018-08-09 [1] CRAN (R 3.5.0)
 sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 3.5.0)
 stringi     * 1.4.4      2019-06-12 [1] Github (gagolews/stringi@6e020c2)
 stringr     * 1.4.0      2019-02-10 [1] CRAN (R 3.5.2)
 testthat      2.1.1      2019-04-23 [1] CRAN (R 3.5.2)
 tibble      * 2.1.1      2019-03-16 [1] CRAN (R 3.5.2)
 tidyr       * 0.8.3.9000 2019-05-10 [1] Github (tidyverse/tidyr@8392cb1)
 tidyselect    0.2.5      2018-10-11 [1] CRAN (R 3.5.0)
 tidyverse   * 1.2.1      2017-11-14 [1] CRAN (R 3.5.0)
 tweenr        1.0.1      2018-12-14 [1] CRAN (R 3.5.0)
 usethis       1.5.0      2019-04-07 [1] CRAN (R 3.5.2)
 vctrs         0.1.0.9003 2019-05-20 [1] Github (r-lib/vctrs@b1e6b81)
 vroom       * 1.0.0      2019-05-04 [1] CRAN (R 3.5.2)
 withr         2.1.2      2018-03-15 [1] CRAN (R 3.5.0)
 xml2          1.2.0      2018-01-24 [1] CRAN (R 3.5.0)
 zeallot       0.1.0      2018-01-28 [1] CRAN (R 3.5.0)
gagolews commented 5 years ago

Tried hard but I couldn't reproduce the error.

Does the same happen when you use the alternative to tolower, i.e., stringi::stri_trans_general("TEST StRiNG", "lower")?

ltrgoddard commented 5 years ago

Looks like it's some crazy edge case then! Using stri_trans_general doesn't result in an error.

giocomai commented 3 years ago

I think I have stumbled upon the same issue, and since it happens with a dataset I can share, I post it here in case it may be of help, even if I'd be prone to blame vroom.

By "the same issue", I mean an error that happens only when using vroom and stringi::stri_trans_tolower(). I tried it in different sessions and on different computers, and I always get an error for one of the first lines, but not always the same line.

Error in stringi::stri_trans_tolower(.) : embedded nul in string:

To clarify: if I restart or change computer, then I'd still get the error, but with a different line, which makes me think the problem is not fully with the data itself.

Also, I don't get any error if I use readr::read_csv, or if I save the same dataset with just the first hundred lines.

(the dataset is a 65MB csv of press releases of the Kremlin with metadata, originally distributed under a CC-BY license)

vroom::vroom("https://testzone.giorgiocomai.eu/vroom_stringi_dataset/president_ru-en_corpus.csv") %>%
  dplyr::pull(text) %>%
  stringi::stri_trans_tolower() %>%
  head(100)

The csv is originally written with readr, the contents are parsed from the web via xml2 and rvest, then slightly processed via stringr, and passed through an iconv(to = "UTF-8") to be on the safe side.

gagolews commented 3 years ago

Just use classic read.csv() to read the CSV file in its entirety.

x <- read.csv("president_ru-en_corpus.csv")
y <- dplyr::pull(x, text)
z <- stringi::stri_trans_tolower(y)

Works like a charm.

It might be an issue with the vroom package, I'd recommending exploring this further and notifying the package maintainer.

jimhester commented 3 years ago

Can this issue be re-opened? I am pretty confident this is actually an issue with stringi.

stringi is making an unsafe assumption that const char* returned from CHAR() are valid as long as the STRSXP which contains them is protected. While this has been historically true for normal CHARSXP vectors is not true for ALTREP CHARSXP vectors. Normal vectors are implicitly protected by their STRSXP containers, so are stable as long as the STSXP is not garbage collected. But ALTREP CHARSXP returned from STRING_ELT() don't explicitly exist in a STRSXP, so R is free to collect them if they are not protected.

To illustrate this is the root of the problem the below diff fixes @giocomai's issue, copying the data if the vector is an ALTREP vector.

I imagine this assumption is found elsewhere in the stringi codebase, but @gagolews would know for sure.

diff --git a/src/stri_container_utf8.cpp b/src/stri_container_utf8.cpp
index 4f26ed3a..2405db8f 100644
--- a/src/stri_container_utf8.cpp
+++ b/src/stri_container_utf8.cpp
@@ -99,14 +99,15 @@ StriContainerUTF8::StriContainerUTF8(SEXP rstr, R_len_t _nrecycle, bool _shallow
         if (curs == NA_STRING) {
             continue; // keep NA
         }
+        bool should_copy = ALTREP(rstr) == 1;

         if (IS_ASCII(curs)) {
             // ASCII - ultra fast
-            this->str[i].initialize(CHAR(curs), LENGTH(curs), false/*!_shallowrecycle*/, false/*killbom*/, true/*isASCII*/);
+            this->str[i].initialize(CHAR(curs), LENGTH(curs), should_copy/*!_shallowrecycle*/, false/*killbom*/, true/*isASCII*/);
         }
         else if (IS_UTF8(curs)) {
             // UTF-8 - ultra fast
-            this->str[i].initialize(CHAR(curs), LENGTH(curs), false/*!_shallowrecycle*/, true/*killbom*/, false/*isASCII*/);
+            this->str[i].initialize(CHAR(curs), LENGTH(curs), should_copy/*!_shallowrecycle*/, true/*killbom*/, false/*isASCII*/);
             // the same is done for native encoding && ucnvNative_isUTF8
             // @TODO: use macro (here & ucnvNative_isUTF8 below)
         }
@@ -128,7 +129,7 @@ StriContainerUTF8::StriContainerUTF8(SEXP rstr, R_len_t _nrecycle, bool _shallow
                     // UTF-8 - ultra fast
                     // @TODO: use macro
                     this->str[i].initialize(CHAR(curs), LENGTH(curs),
-                                            false /*!_shallowrecycle*/, true/*killbom*/, false/*isASCII*/);
+                                            should_copy /*!_shallowrecycle*/, true/*killbom*/, false/*isASCII*/);
                     continue;
                 } 
jimhester commented 3 years ago

As some further evidence that this is not an issue with vroom itself you can exhibit the same behavior saving the data with the qs package, which also uses string ALTREP vectors e.g.

x <- read.csv("https://testzone.giorgiocomai.eu/vroom_stringi_dataset/president_ru-en_corpus.csv")
qs::qsave(x, "~/data/corpus.qs")

# Then start a new R session and run the following, it will likely crash or give you embedded null errors.
y <- qs::qread("~/data/corpus.qs", use_alt_rep = TRUE)
y <- stringi::stri_trans_tolower(y[["text"]])
gagolews commented 3 years ago

Yes, I should definitely fix this.

I see that ALTREP is very badly documented in https://cran.r-project.org/doc/manuals/r-devel/R-ints.html - why is that the case?

gagolews commented 3 years ago

Is https://svn.r-project.org/R/branches/ALTREP/ALTREP.html the ultimate info source on this?

jimhester commented 3 years ago

Yes, that vignette is probably the best source of information apart from the R sources themselves.

gagolews commented 3 years ago

[note to self] Before:

# cd /tmp; wget https://testzone.giorgiocomai.eu/vroom_stringi_dataset/president_ru-en_corpus.csv
Rscript -e 'qs::qsave(read.csv("president_ru-en_corpus.csv"), "/tmp/corpus.qs")'
Rscript -e 'stringi::stri_trans_tolower(qs::qread("/tmp/corpus.qs", use_alt_rep = TRUE)[["text"]])'
##Error in stringi::stri_trans_tolower(qs::qread("/tmp/corpus.qs", use_alt_rep = TRUE)[["text"]]) : 
##  embedded nul in string: 'vladimir putin arrived in turkmenistan\n\nmay 19, 2000, ashgabat \n\nbefore the official events began, 
## Execution halted
gagolews commented 3 years ago

[note to self] ALTREP is R >= 3.5.0, need dummy ALTREP macro otherwise

gagolews commented 3 years ago

This has now been fixed. Thanks @ltrgoddard for filing this issue and @jimhester for a workaround.

Rscript -e 'invisible(stringi::stri_trans_tolower(qs::qread("/tmp/corpus.qs", use_alt_rep = TRUE)[["text"]]))'
## clean.