fstpackage / fst

Lightning Fast Serialization of Data Frames for R
http://www.fstpackage.org/fst/
GNU Affero General Public License v3.0
615 stars 42 forks source link

reading from (public bucket S3) #162

Open womta opened 5 years ago

womta commented 5 years ago

I have a link in a public repository in AWS S3 https://s3-eu-west-1.amazonaws.com/stratominerdata/developer.stratominer.com/processedData/testingS3/default/testingS3_raw_plate_10_rep_1.fst which I can download anywhere

but when I try to load it in to memory directly from the link it fails... why? read.fst('https://s3-eu-west-1.amazonaws.com/stratominerdata/developer.stratominer.com/processedData/testingS3/default/testingS3_raw_plate_10_rep_1.fst')

ERROR: Error opening fst file for reading, please check access rights and file availability

I have an equivalent file in csv which can be read in directly by data.table::fread('https://s3-eu-west-1.amazonaws.com/stratominerdata/developer.stratominer.com/rawData/testingS3/default/file01.txt')

MarcusKlik commented 5 years ago

Hi @womta, thanks for reporting the issue!

Yes, fst cannot easily access an online file directly, mostly because it requires random access during reading and it will fail on trying to open the file (in the C++ backend) when a https URL is specified.

The data.table package is able to open the file because it first downloads the file to a temp location and then reads it from there (see this code). The curl::curl_download method is used to download the file. You could simulate this with your fst file:

tmp_file <- tempfile()
fst_url <- "https://s3-eu-west-1.amazonaws.com/stratominerdata/developer.stratominer.com/processedData/testingS3/default/testingS3_raw_plate_10_rep_1.fst"
curl::curl_download(fst_url, tmp_file, mode="wb")
ft <- fst::read_fst(tmp_file)

Obviously, you won't have random access, but you can still get the data.

I think a better error message would be useful when the user specifies a https URL as the path. Or, for complete loads, the same method could be used as with data.table, so leverage a full download with the curl package, what do you think?

womta commented 5 years ago

Hi @MarcusKlik

Thank you for the fast reply. Is random access still required when the arguments 'columns', 'from' and 'to' are not specified in read_fst() i.e. it reads the full file anyway Or does it still use random access?

Implementing the retrieval of the file before trying it and failing seems to be a nice improvement in the package...

Right now I use the aws.s3 package to store the file locally with save_object()

I noticed that s3:// links do work with fst but somehow with larger files it sometimes gives the exact same error...

MarcusKlik commented 5 years ago

Hi @womta, you're right, when all columns and rows are loaded, the fst file is read completely sequentially, so it could be downloaded in the same manner as fread() does. That would be a nice improvement to the package.

It's interesting to see that objects in s3 buckets can be accessed randomly using the Range specifier (see this code). That means that a cloud based fst file could in principle be read with random row- and column access by leveraging the aws.s3 package. Perhaps that's something to look at for a future enhancement, thanks!

HugoGit39 commented 2 months ago

fread

Is this still the only method to read an url with a .fst file?