Closed daswesen closed 6 years ago
What code did you use to call that? and can you give me the full output of sessionInfo()
?
Thanks for the quick response. My code:
library("aws.s3")
Sys.setenv("AWS_ACCESS_KEY_ID" = " keyA",
"AWS_SECRET_ACCESS_KEY" = "keyB")
obj <-get_object("s3://myBucketName/aFolder/fileName.csv")
csvcharobj <- rawToChar(obj)
con <- textConnection(csvcharobj)
data <- read.csv(file = con)
sessionInfo() R version 3.4.1 (2017-06-30) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.3 LTS
Matrix products: default BLAS: /usr/lib/openblas-base/libblas.so.3 LAPACK: /usr/lib/libopenblasp-r0.2.18.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] aws.s3_0.3.3
loaded via a namespace (and not attached):
[1] httr_1.3.1 compiler_3.4.1 R6_2.2.2
[4] tools_3.4.1 base64enc_0.1-3 curl_2.8.1
[7] Rcpp_0.12.12 aws.signature_0.3.5 xml2_1.1.1
[10] knitr_1.17 digest_0.6.12
You don't need to convert to character to do this. You could just pass the response from get_object()
to rawConnection()
. Alternatively, use save_object()
to save it locally and then use an efficient importing function like readr::read_csv()
or data.table::fread()
.
Thanks. I want to keep the data in RAM rather than creating a local copy on disk. I am struggling to work with rawConnection after reading the help several times, unfortunately, and I apologize for asking you for your help here.
This command: obj <-get_object("s3://myBucketName/aFolder/fileName.csv") con <- rawConnection(obj, "r") data2 <- read.table(con, sep=",", header=TRUE)
has this result: Error in pushBack(c(lines, lines), file, encoding = pbEncoding) : can only push back on text-mode connections
I managed to create a large character by using
data <- readLines(con)
but I haven't managed to get this into a dataframe without R capitulating; in fact, it is even faster to download the file from S3 manually and upload it to R. I guess that's not the way it should be :-)
I will be looking at how to use Apache Spark for this in the future but I was thinking that I should be able to do this without on a machine with lots of RAM, given that the file is not too big.
Your help would be much appreciated!
That code looks like it should work. Perhaps try:
obj <- get_object("s3://myBucketName/aFolder/fileName.csv")
data2 <- read.csv(rawConnection(obj))
Another option, that might be easier is the s3read_using()
function:
s3read_using(FUN = read.csv, object = "s3://myBucketName/aFolder/fileName.csv")
data2 <- read.csv(rawConnection(obj)) Error in pushBack(c(lines, lines), file, encoding = pbEncoding) : can only push back on text-mode connections
s3read_using(FUN = read.csv, object = "s3://myBucketName/aFolder/fileName.csv") Error in writeBin(httr::content(r, as = "raw"), con = file) : long vectors not supported yet: connections.c:4123 In addition: Warning message: In inherits(x, "factor") : closing unused connection 3 (obj)
It works great with smaller files but not with files of that size, unfortunately :-/ I'll check whether I can solve this by reading it directly into a Spark cluster. Thanks for your help though!
Seems it might be too large to work with R's in-memory connections infrastructure. Sorry.
I get a similar issue when I try to write a data frame to S3:
Error in writeBin(httr::content(r, as = "raw"), con = file) : long vectors not supported yet: ../../../../R-3.4.3/src/main/connections.c:4147 Called from: writeBin(httr::content(r, as = "raw"), con = file)
I think the consensus from the httr side of things is to write to disk. There probably isn't a way around it without low-level changes to R. See https://github.com/r-lib/httr/issues/44
@daswesen did you ever figure it out?
This probably is not a bug but expected behavior but I was wondering how to read larger files from S3? This is the error message I get for a 3 GB file:
`> csvcharobj <- rawToChar(obj)
Error in rawToChar(obj) : long vectors not supported yet: raw.c:68`