cloudyr / googleCloudStorageR

Google Cloud Storage API to R
https://code.markedmondson.me/googleCloudStorageR
Other
104 stars 29 forks source link

Load .RDS files directly into environment `gcs_get_object`? #146

Closed samuel-marsh closed 9 months ago

samuel-marsh commented 3 years ago

Hi,

This might be naive question and I might be missing something but wondering if there is way to load file saved as a .RDS file from GCP bucket directly into local R environment without saving to disk first?

I have been currently trying this with objects created with the single cell analysis package Seurat which creates S4 class object (See more info on Seurat Objects format see here: https://github.com/mojaveazure/seurat-object and here: https://github.com/satijalab/seurat/wiki.

When I run:

obj <- gcs_get_object(object_name = "gs://bucket_name/obj.RDS")

It loads into the environment as a "Raw" file that is then unreadable by Seurat. If I add saveToDisk = "obj.RDS" and then subsequently read it into R with readRDS (or wrapper read_rds) then it works just fine and is readable by Seurat.

Wondering whether there is additional parameter I missing specifying that would allow this or if not whether this is feature that could be added?

Thanks! Sam

MarkEdmondson1234 commented 3 years ago

Yes you can supply a custom parse function to load the object directly into R. You would want something like readRDS().

All the downloads write to disk at least temporarily so it's not more efficient, but a lot more convenient:)

samuel-marsh commented 3 years ago

Hi Mark,

Thanks for quick response. This must be what I'm not quite understanding because when I run:

obj <- gcs_get_object(object_name = "gs://bucket_name/obj.RDS", parseFunction = readRDS())

I get an error that the parsing failed.

Thanks! Sam

MarkEdmondson1234 commented 3 years ago

Sorry I thought this would be simpler but actually the raw RDS response is harder to deal with than I thought. The best I can come up with is a wrapper to saveToDisk then load it which will do what I thought it should do:

my_parse <- function(obj){
     tmp <- tempfile(fileext = ".rds")
     on.exit(unlink(tmp))
     suppressMessages(gcs_get_object(obj, saveToDisk = tmp))
     readRDS(tmp)
 }
obj <- my_parse("gs://bucket_name/obj.RDS")

I will look at if this can be improved :)

MarkEdmondson1234 commented 3 years ago

Rich Fergie found the right functions for parsing RDS without needing to save to disk for you: https://twitter.com/RichardFergie/status/1385531335423447040

f <- function(obj) {
  readRDS(gzcon(rawConnection(httr::content(obj))))
}
gcs_get_object("obj.rds", parseFunction = f)
MarkEdmondson1234 commented 3 years ago

I added the function as a helper as it looked useful, so for the GitHub version you can use:

gcs_get_object("obj.rds", parseFunction = gcs_parse_rds)

See ?gcs_parse_rds

samuel-marsh commented 3 years ago

Hey Mark,

Really appreciate your help on this! Unfortunately still getting errors when I try myself. Although the errors are different depending on whether it is the GitHub branch or CRAN version.

Using github master branch and running the code below results in following error:

test <- gcs_get_object(object_name = "gs://bucket_name/exp17.RDS", parseFunction = gcs_parse_rds)
i Downloading exp17_micro.RDSError: Problem parsing the object with supplied parseFunction.
x Downloading exp17_micro.RDS ... failed

If I revert to the CRAN version and using the custom parse function itself from global env I get following error messages:

f <- function(obj) {
  readRDS(gzcon(rawConnection(httr::content(obj))))
}

test <- gcs_get_object(object_name = "gs://bucket_name/exp17.RDS", parseFunction = gcs_parse_rds)
Downloaded exp17_micro.RDS
Error in readRDS(gzcon(rawConnection(httr::content(obj)))) : 
  too large a block specified
Error in gcs_get_object(object_name = "gs://stevens_data_marsh/exp17_micro.RDS",  : 
  Problem parsing the object with supplied parseFunction.

For reference the RDS object that I'm testing this with is 2.4GB.

Also including sessionInfo below for reference in case it's helpful!

Thanks again so much for all your help on this and quick response!! Sam

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.3

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] beepr_1.3                 Seurat_3.2.3             
 [3] forcats_0.5.0             stringr_1.4.0            
 [5] dplyr_1.0.5               purrr_0.3.4              
 [7] readr_1.3.1               tidyr_1.1.0              
 [9] tibble_3.0.1              ggplot2_3.3.0            
[11] tidyverse_1.3.0           googleCloudStorageR_0.6.0

loaded via a namespace (and not attached):
  [1] Rtsne_0.15            colorspace_1.4-1      deldir_0.1-28        
  [4] ellipsis_0.3.1        ggridges_0.5.2        fs_1.4.1             
  [7] spatstat.data_1.4-3   rstudioapi_0.11       leiden_0.3.3         
 [10] listenv_0.8.0         remotes_2.1.1         audio_0.1-7          
 [13] ggrepel_0.8.2         lubridate_1.7.8       xml2_1.3.2           
 [16] codetools_0.2-16      splines_3.6.1         polyclip_1.10-0      
 [19] jsonlite_1.6.1        packrat_0.5.0         broom_0.5.6          
 [22] ica_1.0-2             cluster_2.1.0         dbplyr_1.4.3         
 [25] png_0.1-7             uwot_0.1.10           sctransform_0.3.1    
 [28] shiny_1.4.0.2         compiler_3.6.1        httr_1.4.1           
 [31] backports_1.1.7       lazyeval_0.2.2        assertthat_0.2.1     
 [34] Matrix_1.2-18         fastmap_1.0.1         gargle_1.1.0         
 [37] cli_2.4.0             later_1.0.0           htmltools_0.5.1.1    
 [40] tools_3.6.1           rsvd_1.0.3            igraph_1.2.5         
 [43] gtable_0.3.0          glue_1.4.1            reshape2_1.4.4       
 [46] RANN_2.6.1            rappdirs_0.3.1        spatstat_1.64-1      
 [49] Rcpp_1.0.6            scattermore_0.7       cellranger_1.1.0     
 [52] vctrs_0.3.6           nlme_3.1-148          lmtest_0.9-37        
 [55] globals_0.14.0        rvest_0.3.5           mime_0.9             
 [58] miniUI_0.1.1.1        lifecycle_1.0.0       irlba_2.3.3          
 [61] goftest_1.2-2         future_1.21.0         googleAuthR_1.3.1    
 [64] MASS_7.3-51.6         zoo_1.8-8             scales_1.1.1         
 [67] spatstat.utils_1.17-0 hms_0.5.3             promises_1.1.0       
 [70] parallel_3.6.1        RColorBrewer_1.1-2    yaml_2.2.1           
 [73] curl_4.3              gridExtra_2.3         memoise_1.1.0        
 [76] reticulate_1.15       pbapply_1.4-2         rpart_4.1-15         
 [79] stringi_1.4.6         zip_2.0.4             rlang_0.4.10         
 [82] pkgconfig_2.0.3       matrixStats_0.56.0    lattice_0.20-41      
 [85] tensor_1.5            ROCR_1.0-11           patchwork_1.0.0      
 [88] htmlwidgets_1.5.1     cowplot_1.0.0         tidyselect_1.1.0     
 [91] parallelly_1.21.0     RcppAnnoy_0.0.18      plyr_1.8.6           
 [94] magrittr_1.5          R6_2.4.1              generics_0.0.2       
 [97] DBI_1.1.0             mgcv_1.8-31           pillar_1.4.4         
[100] haven_2.3.0           withr_2.2.0           fitdistrplus_1.1-1   
[103] abind_1.4-5           survival_3.1-12       future.apply_1.5.0   
[106] modelr_0.1.8          crayon_1.3.4          KernSmooth_2.23-17   
[109] plotly_4.9.2.1        grid_3.6.1            readxl_1.3.1         
[112] data.table_1.12.8     reprex_0.3.0          digest_0.6.25        
[115] xtable_1.8-4          httpuv_1.5.2          openssl_1.4.1        
[118] munsell_0.5.0         viridisLite_0.3.0     askpass_1.1  
MarkEdmondson1234 commented 3 years ago

Ok cool, seems your RDS is a special case compared to mine ;) May I ask if the RDS files you are using "old" in that they were done before R 3.5? They changed the format type in that release, just trying to eliminate it as a cause.

MarkEdmondson1234 commented 3 years ago

Could you also issue traceback() after your error to see which function is triggering it?

MarkEdmondson1234 commented 3 years ago

And I guess writing to disk should work ok?

my_parse <- function(obj){
     tmp <- tempfile(fileext = ".rds")
     on.exit(unlink(tmp))
     suppressMessages(gcs_get_object(obj, saveToDisk = tmp))
     readRDS(tmp)
 }
obj <- my_parse("gs://bucket_name/obj.RDS")

It may be that 2.4GB is just too big for R to decompress

LukasWallrich commented 3 years ago

FYI: for me, this works with a 10.2GB .RDS file that is saved without compression (with readr::write_rds). So the file size per se, at least, is not the issue. Thanks for implementing this very convenient parser function!

MarkEdmondson1234 commented 3 years ago

Thanks @LukasWallrich good to know. I think then @samuel-marsh 's rds file must have something unique about it - if it is downloaded locally trying to debug where the readRDS(gzcon(rawConnection(httr::content(obj)))) goes wrong would be a start.

aldomann commented 2 years ago

Sorry I thought this would be simpler but actually the raw RDS response is harder to deal with than I thought. The best I can come up with is a wrapper to saveToDisk then load it which will do what I thought it should do:

my_parse <- function(obj){
     tmp <- tempfile(fileext = ".rds")
     on.exit(unlink(tmp))
     suppressMessages(gcs_get_object(obj, saveToDisk = tmp))
     readRDS(tmp)
 }
obj <- my_parse("gs://bucket_name/obj.RDS")

I will look at if this can be improved :)

Somehow unrelated, this strategy also works for parsing UTF-16LE CSV files, which I haven't managed to do by just using read.csv(x, fileEncoding = "UTF-16LE") as the parseFunction.

MarkEdmondson1234 commented 2 years ago

I forgot to put here that gce_parse_rds() in now in the dev version vai this commit https://github.com/cloudyr/googleCloudStorageR/commit/d912d0cf1998ac34d0e7351ac183f66eb4625708

If there are other useful parsing functions I'd be glad to put them in.

lifedeathandtech commented 9 months ago

@MarkEdmondson1234 - I think you might have meant to type gcs_parse_rds().

Thank you so much for your contributions! googleCloudStorageR and googleCloudRunner are incredibly useful tools.

MarkEdmondson1234 commented 9 months ago

Ah yes that is it gcs vs gce - got confusing sometimes working on the packages at same time ;) glad they are helpful!