Cellular-Longevity / cmapR

Tools for manipulating annotated data matrices
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

Enable data subsetting directly from S3 #2

Open bsiranosian opened 2 years ago

bsiranosian commented 2 years ago

Data subsetting when reading directly from S3 does not currently work when implemented like this:

library(aws.s3)
object.loc <- "s3://bioinformatics-loyal/processed_methylation_data/HEALTHSPAN/GH40_RRBS/matrices_processed/methylation_filtered.gctx"
mgct <- s3read_using(FUN = function(x) parse_gctx(x, rid=1), object = object.loc)

Instead, the whole file is downloaded to a temp directory, and a portion of it is read from there.

This should be possible as rhdf5 supports read-only access to files in S3: https://www.bioconductor.org/packages/devel/bioc/vignettes/rhdf5/inst/doc/rhdf5_cloud_reading.html

However, I'm currently hit with the error described here, and haven't gone any further: https://support.bioconductor.org/p/9134972/

DavidTingley commented 2 years ago

A note that I'll be looking into this for python as well!

DavidTingley commented 2 years ago

an alternative we could look into:

https://pytorch.org/data/0.3.0/generated/torchdata.datapipes.iter.OnlineReader.html#torchdata.datapipes.iter.OnlineReader

Currently doesn't look like they have s3 support?

MurphyMarkW commented 2 years ago

Can help out with this, but it's been almost a decade since I've written anything in R.

I have a couple of tasks I need to work out by EOW, but I'll give a shot at implemented S3 subsetting (preferably in python, but can figure it out for R) if it's not implemented in our preferred solution. There's a number of libraries I've encountered like this where it's just not really there. :\

bsiranosian commented 2 years ago

No worries if you don't have a good immediate solution - just tagged you for visibility since I thought you might have a good idea. I believe we're facing the same issue with the Python implementation as well.