Closed ltrgoddard closed 3 years ago
Apologies, I'm gonna need a fully reproducible example.
Fair enough. I'm fairly sure it's a memory issue, so it's difficult to reproduce consistently -- it will occur in a fresh R session, but then not when the code is run again in the same session, for example. See if the code below works for you, though, using this data as "test.csv".
# Unnecessary packages
library(tidyverse)
library(lubridate)
library(gganimate)
# Necessary packages
library(stringi)
library(vroom)
data <- vroom("test.csv", col_names = F)
stri_trans_tolower(data$X1)
For me, this consistently gives the following output:
Error in stri_trans_tolower(data$X1) :
embedded nul in string: 'pfnin2831643\0v'
Execution halted
Clues:
Rscript
, not from RStudio;library()
calls, the string given in the error message changes, hence the assumption that it's a memory issue (and perhaps one that's upstream of stringi);I'm using R 3.5.3 on macOS Mojave, with the latest GitHub version of stringi and the latest CRAN versions of all the other packages.
Can you give me the details on your devtools::session_info()
when running Rscript
? Just add print(devtools::session_info())
in the .R
script right before the supposedly buggy call
Sure -- here you go.
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 3.5.3 (2019-03-11)
os macOS Mojave 10.14.5
system x86_64, darwin15.6.0
ui X11
language (EN)
collate en_GB.UTF-8
ctype en_GB.UTF-8
tz Europe/London
date 2019-06-14
─ Packages ───────────────────────────────────────────────────────────────────
package * version date lib source
assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.5.2)
backports 1.1.4 2019-04-10 [1] CRAN (R 3.5.2)
broom 0.5.2 2019-04-07 [1] CRAN (R 3.5.2)
callr 3.2.0 2019-03-15 [1] CRAN (R 3.5.2)
cellranger 1.1.0 2016-07-27 [1] CRAN (R 3.5.0)
cli 1.1.0 2019-03-19 [1] CRAN (R 3.5.2)
colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.5.2)
crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.0)
desc 1.2.0 2018-05-01 [1] CRAN (R 3.5.0)
devtools 2.0.2 2019-04-08 [1] CRAN (R 3.5.2)
digest 0.6.19 2019-05-20 [1] CRAN (R 3.5.3)
dplyr * 0.8.0.1 2019-02-15 [1] CRAN (R 3.5.2)
farver 1.1.0 2018-11-20 [1] CRAN (R 3.5.0)
forcats * 0.4.0 2019-02-17 [1] CRAN (R 3.5.2)
fs 1.2.7 2019-03-19 [1] CRAN (R 3.5.2)
generics 0.0.2 2018-11-29 [1] CRAN (R 3.5.0)
gganimate * 1.0.3 2019-04-02 [1] CRAN (R 3.5.2)
ggplot2 * 3.1.1 2019-04-07 [1] CRAN (R 3.5.2)
glue 1.3.1 2019-03-12 [1] CRAN (R 3.5.2)
gtable 0.3.0 2019-03-25 [1] CRAN (R 3.5.2)
haven 2.1.0 2019-02-19 [1] CRAN (R 3.5.2)
hms 0.4.2 2018-03-10 [1] CRAN (R 3.5.0)
httr 1.4.0 2018-12-11 [1] CRAN (R 3.5.0)
jsonlite 1.6 2018-12-07 [1] CRAN (R 3.5.0)
lattice 0.20-38 2018-11-04 [1] CRAN (R 3.5.3)
lazyeval 0.2.2 2019-03-15 [1] CRAN (R 3.5.2)
lubridate * 1.7.4 2018-04-11 [1] CRAN (R 3.5.0)
magrittr 1.5 2014-11-22 [1] CRAN (R 3.5.0)
memoise 1.1.0 2017-04-21 [1] CRAN (R 3.5.0)
modelr 0.1.4 2019-02-18 [1] CRAN (R 3.5.2)
munsell 0.5.0 2018-06-12 [1] CRAN (R 3.5.0)
nlme 3.1-137 2018-04-07 [1] CRAN (R 3.5.3)
pillar 1.4.0 2019-05-11 [1] CRAN (R 3.5.3)
pkgbuild 1.0.3 2019-03-20 [1] CRAN (R 3.5.2)
pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.5.0)
pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.5.0)
plyr 1.8.4 2016-06-08 [1] CRAN (R 3.5.0)
prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.5.0)
processx 3.3.0 2019-03-10 [1] CRAN (R 3.5.2)
progress 1.2.2 2019-05-16 [1] CRAN (R 3.5.2)
ps 1.3.0 2018-12-21 [1] CRAN (R 3.5.0)
purrr * 0.3.2 2019-03-15 [1] CRAN (R 3.5.2)
R6 2.4.0 2019-02-14 [1] CRAN (R 3.5.2)
Rcpp 1.0.1 2019-03-17 [1] CRAN (R 3.5.2)
readr * 1.3.1 2018-12-21 [1] CRAN (R 3.5.0)
readxl 1.3.1.9000 2019-05-20 [1] Github (tidyverse/readxl@15fde89)
remotes 2.0.4 2019-04-10 [1] CRAN (R 3.5.2)
rlang 0.3.4.9003 2019-05-10 [1] Github (r-lib/rlang@6a232c0)
rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.5.0)
rstudioapi 0.10 2019-03-19 [1] CRAN (R 3.5.2)
rvest 0.3.3 2019-04-11 [1] CRAN (R 3.5.2)
scales 1.0.0 2018-08-09 [1] CRAN (R 3.5.0)
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.5.0)
stringi * 1.4.4 2019-06-12 [1] Github (gagolews/stringi@6e020c2)
stringr * 1.4.0 2019-02-10 [1] CRAN (R 3.5.2)
testthat 2.1.1 2019-04-23 [1] CRAN (R 3.5.2)
tibble * 2.1.1 2019-03-16 [1] CRAN (R 3.5.2)
tidyr * 0.8.3.9000 2019-05-10 [1] Github (tidyverse/tidyr@8392cb1)
tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.5.0)
tidyverse * 1.2.1 2017-11-14 [1] CRAN (R 3.5.0)
tweenr 1.0.1 2018-12-14 [1] CRAN (R 3.5.0)
usethis 1.5.0 2019-04-07 [1] CRAN (R 3.5.2)
vctrs 0.1.0.9003 2019-05-20 [1] Github (r-lib/vctrs@b1e6b81)
vroom * 1.0.0 2019-05-04 [1] CRAN (R 3.5.2)
withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.0)
xml2 1.2.0 2018-01-24 [1] CRAN (R 3.5.0)
zeallot 0.1.0 2018-01-28 [1] CRAN (R 3.5.0)
Tried hard but I couldn't reproduce the error.
Does the same happen when you use the alternative to tolower
, i.e., stringi::stri_trans_general("TEST StRiNG", "lower")
?
Looks like it's some crazy edge case then! Using stri_trans_general
doesn't result in an error.
I think I have stumbled upon the same issue, and since it happens with a dataset I can share, I post it here in case it may be of help, even if I'd be prone to blame vroom
.
By "the same issue", I mean an error that happens only when using vroom
and stringi::stri_trans_tolower()
. I tried it in different sessions and on different computers, and I always get an error for one of the first lines, but not always the same line.
Error in stringi::stri_trans_tolower(.) : embedded nul in string:
To clarify: if I restart or change computer, then I'd still get the error, but with a different line, which makes me think the problem is not fully with the data itself.
Also, I don't get any error if I use readr::read_csv
, or if I save the same dataset with just the first hundred lines.
(the dataset is a 65MB csv of press releases of the Kremlin with metadata, originally distributed under a CC-BY license)
vroom::vroom("https://testzone.giorgiocomai.eu/vroom_stringi_dataset/president_ru-en_corpus.csv") %>%
dplyr::pull(text) %>%
stringi::stri_trans_tolower() %>%
head(100)
The csv is originally written with readr
, the contents are parsed from the web via xml2
and rvest
, then slightly processed via stringr, and passed through an iconv(to = "UTF-8")
to be on the safe side.
Just use classic read.csv()
to read the CSV file in its entirety.
x <- read.csv("president_ru-en_corpus.csv")
y <- dplyr::pull(x, text)
z <- stringi::stri_trans_tolower(y)
Works like a charm.
It might be an issue with the vroom
package, I'd recommending exploring this further and notifying the package maintainer.
Can this issue be re-opened? I am pretty confident this is actually an issue with stringi.
stringi is making an unsafe assumption that const char*
returned from CHAR()
are valid as long as the STRSXP
which contains them is protected. While this has been historically true for normal CHARSXP
vectors is not true for ALTREP CHARSXP
vectors. Normal vectors are implicitly protected by their STRSXP
containers, so are stable as long as the STSXP
is not garbage collected. But ALTREP CHARSXP
returned from STRING_ELT()
don't explicitly exist in a STRSXP
, so R is free to collect them if they are not protected.
To illustrate this is the root of the problem the below diff fixes @giocomai's issue, copying the data if the vector is an ALTREP vector.
I imagine this assumption is found elsewhere in the stringi codebase, but @gagolews would know for sure.
diff --git a/src/stri_container_utf8.cpp b/src/stri_container_utf8.cpp
index 4f26ed3a..2405db8f 100644
--- a/src/stri_container_utf8.cpp
+++ b/src/stri_container_utf8.cpp
@@ -99,14 +99,15 @@ StriContainerUTF8::StriContainerUTF8(SEXP rstr, R_len_t _nrecycle, bool _shallow
if (curs == NA_STRING) {
continue; // keep NA
}
+ bool should_copy = ALTREP(rstr) == 1;
if (IS_ASCII(curs)) {
// ASCII - ultra fast
- this->str[i].initialize(CHAR(curs), LENGTH(curs), false/*!_shallowrecycle*/, false/*killbom*/, true/*isASCII*/);
+ this->str[i].initialize(CHAR(curs), LENGTH(curs), should_copy/*!_shallowrecycle*/, false/*killbom*/, true/*isASCII*/);
}
else if (IS_UTF8(curs)) {
// UTF-8 - ultra fast
- this->str[i].initialize(CHAR(curs), LENGTH(curs), false/*!_shallowrecycle*/, true/*killbom*/, false/*isASCII*/);
+ this->str[i].initialize(CHAR(curs), LENGTH(curs), should_copy/*!_shallowrecycle*/, true/*killbom*/, false/*isASCII*/);
// the same is done for native encoding && ucnvNative_isUTF8
// @TODO: use macro (here & ucnvNative_isUTF8 below)
}
@@ -128,7 +129,7 @@ StriContainerUTF8::StriContainerUTF8(SEXP rstr, R_len_t _nrecycle, bool _shallow
// UTF-8 - ultra fast
// @TODO: use macro
this->str[i].initialize(CHAR(curs), LENGTH(curs),
- false /*!_shallowrecycle*/, true/*killbom*/, false/*isASCII*/);
+ should_copy /*!_shallowrecycle*/, true/*killbom*/, false/*isASCII*/);
continue;
}
As some further evidence that this is not an issue with vroom itself you can exhibit the same behavior saving the data with the qs package, which also uses string ALTREP vectors e.g.
x <- read.csv("https://testzone.giorgiocomai.eu/vroom_stringi_dataset/president_ru-en_corpus.csv")
qs::qsave(x, "~/data/corpus.qs")
# Then start a new R session and run the following, it will likely crash or give you embedded null errors.
y <- qs::qread("~/data/corpus.qs", use_alt_rep = TRUE)
y <- stringi::stri_trans_tolower(y[["text"]])
Yes, I should definitely fix this.
I see that ALTREP is very badly documented in https://cran.r-project.org/doc/manuals/r-devel/R-ints.html - why is that the case?
Is https://svn.r-project.org/R/branches/ALTREP/ALTREP.html the ultimate info source on this?
Yes, that vignette is probably the best source of information apart from the R sources themselves.
[note to self] Before:
# cd /tmp; wget https://testzone.giorgiocomai.eu/vroom_stringi_dataset/president_ru-en_corpus.csv
Rscript -e 'qs::qsave(read.csv("president_ru-en_corpus.csv"), "/tmp/corpus.qs")'
Rscript -e 'stringi::stri_trans_tolower(qs::qread("/tmp/corpus.qs", use_alt_rep = TRUE)[["text"]])'
##Error in stringi::stri_trans_tolower(qs::qread("/tmp/corpus.qs", use_alt_rep = TRUE)[["text"]]) :
## embedded nul in string: 'vladimir putin arrived in turkmenistan\n\nmay 19, 2000, ashgabat \n\nbefore the official events began,
## Execution halted
[note to self] ALTREP is R >= 3.5.0, need dummy ALTREP
macro otherwise
This has now been fixed. Thanks @ltrgoddard for filing this issue and @jimhester for a workaround.
Rscript -e 'invisible(stringi::stri_trans_tolower(qs::qread("/tmp/corpus.qs", use_alt_rep = TRUE)[["text"]]))'
## clean.
I'm getting
embedded nul in string
intermittently fromstri_trans_tolower()
, called viastringr::str_to_lower()
. I can't share a reprex as it's private data, unfortunately, but I'm confident that there isn't actually any null character in the string. It occurs when lowercasing a column of email addresses in a large data frame (>500,000 rows) throughdplyr::mutate()
, almost always at row 180.I think it's probably something to do with vroom's lazy-loading of character columns, as the issue goes away when I ingest the data with
readr::read_tsv()
instead. Opening an issue here first as it's a stringi error, but let me know if it seems to be more on the vroom side.Possibly related to #319?