Open womta opened 5 years ago
Hi @womta, thanks for reporting the issue!
Yes, fst
cannot easily access an online file directly, mostly because it requires random access during reading and it will fail on trying to open the file (in the C++
backend) when a https URL
is specified.
The data.table
package is able to open the file because it first downloads the file to a temp location and then reads it from there (see this code). The curl::curl_download
method is used to download the file. You could simulate this with your fst file:
tmp_file <- tempfile()
fst_url <- "https://s3-eu-west-1.amazonaws.com/stratominerdata/developer.stratominer.com/processedData/testingS3/default/testingS3_raw_plate_10_rep_1.fst"
curl::curl_download(fst_url, tmp_file, mode="wb")
ft <- fst::read_fst(tmp_file)
Obviously, you won't have random access, but you can still get the data.
I think a better error message would be useful when the user specifies a https URL
as the path. Or, for complete loads, the same method could be used as with data.table
, so leverage a full download with the curl
package, what do you think?
Hi @MarcusKlik
Thank you for the fast reply. Is random access still required when the arguments 'columns', 'from' and 'to' are not specified in read_fst() i.e. it reads the full file anyway Or does it still use random access?
Implementing the retrieval of the file before trying it and failing seems to be a nice improvement in the package...
Right now I use the aws.s3 package to store the file locally with save_object()
I noticed that s3:// links do work with fst but somehow with larger files it sometimes gives the exact same error...
Hi @womta, you're right, when all columns and rows are loaded, the fst file is read completely sequentially, so it could be downloaded in the same manner as fread()
does. That would be a nice improvement to the package.
It's interesting to see that objects in s3
buckets can be accessed randomly using the Range specifier (see this code). That means that a cloud based fst file could in principle be read with random row- and column access by leveraging the aws.s3
package. Perhaps that's something to look at for a future enhancement, thanks!
fread
Is this still the only method to read an url with a .fst file?
I have a link in a public repository in AWS S3 https://s3-eu-west-1.amazonaws.com/stratominerdata/developer.stratominer.com/processedData/testingS3/default/testingS3_raw_plate_10_rep_1.fst which I can download anywhere
but when I try to load it in to memory directly from the link it fails... why? read.fst('https://s3-eu-west-1.amazonaws.com/stratominerdata/developer.stratominer.com/processedData/testingS3/default/testingS3_raw_plate_10_rep_1.fst')
ERROR: Error opening fst file for reading, please check access rights and file availability
I have an equivalent file in csv which can be read in directly by data.table::fread('https://s3-eu-west-1.amazonaws.com/stratominerdata/developer.stratominer.com/rawData/testingS3/default/file01.txt')