cloudyr / aws.s3

Amazon Simple Storage Service (S3) API Client
https://cloud.r-project.org/package=aws.s3
381 stars 148 forks source link

Reading larger files from S3 #170

Closed daswesen closed 6 years ago

daswesen commented 7 years ago

This probably is not a bug but expected behavior but I was wondering how to read larger files from S3? This is the error message I get for a 3 GB file:

`> csvcharobj <- rawToChar(obj)

Error in rawToChar(obj) : long vectors not supported yet: raw.c:68`

leeper commented 7 years ago

What code did you use to call that? and can you give me the full output of sessionInfo()?

daswesen commented 7 years ago

Thanks for the quick response. My code:

library("aws.s3") Sys.setenv("AWS_ACCESS_KEY_ID" = " keyA", "AWS_SECRET_ACCESS_KEY" = "keyB") obj <-get_object("s3://myBucketName/aFolder/fileName.csv")
csvcharobj <- rawToChar(obj)
con <- textConnection(csvcharobj)
data <- read.csv(file = con)

sessionInfo() R version 3.4.1 (2017-06-30) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.3 LTS

Matrix products: default BLAS: /usr/lib/openblas-base/libblas.so.3 LAPACK: /usr/lib/libopenblasp-r0.2.18.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] aws.s3_0.3.3

loaded via a namespace (and not attached): [1] httr_1.3.1 compiler_3.4.1 R6_2.2.2
[4] tools_3.4.1 base64enc_0.1-3 curl_2.8.1
[7] Rcpp_0.12.12 aws.signature_0.3.5 xml2_1.1.1
[10] knitr_1.17 digest_0.6.12

leeper commented 7 years ago

You don't need to convert to character to do this. You could just pass the response from get_object() to rawConnection(). Alternatively, use save_object() to save it locally and then use an efficient importing function like readr::read_csv() or data.table::fread().

daswesen commented 7 years ago

Thanks. I want to keep the data in RAM rather than creating a local copy on disk. I am struggling to work with rawConnection after reading the help several times, unfortunately, and I apologize for asking you for your help here.

This command: obj <-get_object("s3://myBucketName/aFolder/fileName.csv") con <- rawConnection(obj, "r") data2 <- read.table(con, sep=",", header=TRUE)

has this result: Error in pushBack(c(lines, lines), file, encoding = pbEncoding) : can only push back on text-mode connections

I managed to create a large character by using

data <- readLines(con)

but I haven't managed to get this into a dataframe without R capitulating; in fact, it is even faster to download the file from S3 manually and upload it to R. I guess that's not the way it should be :-)

I will be looking at how to use Apache Spark for this in the future but I was thinking that I should be able to do this without on a machine with lots of RAM, given that the file is not too big.

Your help would be much appreciated!

leeper commented 7 years ago

That code looks like it should work. Perhaps try:

obj <- get_object("s3://myBucketName/aFolder/fileName.csv")
data2 <- read.csv(rawConnection(obj))

Another option, that might be easier is the s3read_using() function:

s3read_using(FUN = read.csv, object = "s3://myBucketName/aFolder/fileName.csv")
daswesen commented 7 years ago

data2 <- read.csv(rawConnection(obj)) Error in pushBack(c(lines, lines), file, encoding = pbEncoding) : can only push back on text-mode connections

s3read_using(FUN = read.csv, object = "s3://myBucketName/aFolder/fileName.csv") Error in writeBin(httr::content(r, as = "raw"), con = file) : long vectors not supported yet: connections.c:4123 In addition: Warning message: In inherits(x, "factor") : closing unused connection 3 (obj)

It works great with smaller files but not with files of that size, unfortunately :-/ I'll check whether I can solve this by reading it directly into a Spark cluster. Thanks for your help though!

leeper commented 7 years ago

Seems it might be too large to work with R's in-memory connections infrastructure. Sorry.

amarchin commented 6 years ago

I get a similar issue when I try to write a data frame to S3:

Error in writeBin(httr::content(r, as = "raw"), con = file) : long vectors not supported yet: ../../../../R-3.4.3/src/main/connections.c:4147 Called from: writeBin(httr::content(r, as = "raw"), con = file)

leeper commented 6 years ago

I think the consensus from the httr side of things is to write to disk. There probably isn't a way around it without low-level changes to R. See https://github.com/r-lib/httr/issues/44

Christian-Onyango commented 1 year ago

@daswesen did you ever figure it out?