decompress zip to temp file for fread

xhdong-umd commented 7 years ago

This is still about #2

Last time I already have some code to decompress zip to temp folder then read with fread, just I didn't hanlde all the formats. Upon reading source code of R.utils I found it's not too hard to use R builtin connections to decompress zip into files.

So temp_unzip here will take zip, bzip2, xz, gz and decompress to temp folder, run some function then delete the temp file.

I put more details about this function in this gist. Compare to reading uncompressed 160M csv file, using temp_unzip only take very slightly longer time(2.4s vs 2.1s).

The changes to as.temeletry is simple, though I found there are some warnings about some characters were not valid for my system locale when I tested with leroy.csv.gz in move. It looks like when fread read binary file and failed, the warning message have some characters from the binary file which was not suppressed fully with silent = TRUE. Interestingly this warning didn't appear with our first version with fread on Feb.23, but only appear with current build.

I used tryCatch to suppress that warning since it's not really useful. If we ever had some problem and need to check the error message, we can change error = function(e) "error" into error = function(e) print(e), or even just use try().

chfleming commented 7 years ago

A couple of comments:

The two calls to tempdir() are redundant, right?
dest_file <- file.path(tempdir(), "temp_unzipped.csv") should be a call to tempfile() to be safe, I would think.
I'm not very familiar with this, but is there any particular reason for the roughly 10MB buffer set by BFR.SIZE <- 1e7 ? Should this be dependent on system parameters or file.size()?

dracodoc commented 7 years ago

Yes, you are right. The code using connections is copied from R. utils,and I missed these two points.

As for the buffer size, there don't seem to be a strong reason for this number. Though I think it should be limited by memory so you cannot calculate from file size. The buffer may be the chunk size of every write to disk, and the code will read and write by chunks. Consider the normal hard drive speed, I think this value is OK.

xhdong-umd commented 7 years ago

I changed the 2nd tem_dir call to tempfile. I also tested with some different buffer size values, 1e7 looks good.

Unit: seconds
                                       expr      min       lq     mean   median       uq      max neval
 temp_unzip(eg_bz, BFR.SIZE = 1e+07, fread) 5.758906 5.758906 5.758906 5.758906 5.758906 5.758906     1
 temp_unzip(eg_bz, BFR.SIZE = 2e+07, fread) 5.873590 5.873590 5.873590 5.873590 5.873590 5.873590     1
 temp_unzip(eg_bz, BFR.SIZE = 5e+07, fread) 5.925692 5.925692 5.925692 5.925692 5.925692 5.925692     1
 temp_unzip(eg_bz, BFR.SIZE = 1e+08, fread) 5.719511 5.719511 5.719511 5.719511 5.719511 5.719511     1

xhdong-umd commented 7 years ago

One additional benefit is that read.csv doesn't recognize zip format. The R connections only support gzip, bzip2, xz. Now we can read zip too, which is common in windows platform.

chfleming commented 7 years ago

I merged and added a small change to stop() when a zip files has multiple files inside it, instead of silently importing the first file only.

xhdong-umd commented 7 years ago

@chfleming Originally I intentionally only extract first file in zip because I think it will to be too complex to support multiple files in zip. Though your design is better behavior compared to just extract the first file.

Then I found it stopped with the single file zip I created in Mac. Turned out the default GUI tool for creating zip in Mac os will add a hidden __MACOSX folder in zip. I expect there will be quite some Mac users use this method to create zip, and they probably didn't realize there is this hidden folder and would be puzzled about the error.

To separate this case with other multiple files zip would be quite complex, I think maybe the simplest method to support single file zip created in Mac os is still process the first file, but give warning about multiple files, also report the name of first file we used. This should cover the intended "single file zip", and probably will have errors on multiple files zip but given enough information about the error, which should be acceptable.

Added in commit 30590f1cf209cf71babdf8c518e3b639f76c50f3

chfleming commented 7 years ago

I think it would be better to filter out hidden files from consideration and then proceed as before (but maybe with a file listing in the stop message), because users will almost never want to only analyze the first file.

Anything beginning with a "." probably also needs to be included in the filter. Windows also hides files ending with "$".

xhdong-umd commented 7 years ago

I agree users will almost never want to only analyze the first file.

For zip with multiple files, there are several cases:

one data file, some other misc files visible to user, like readme, etc.
multiple data files.
one data file and some hidden system files, like folder created by Mac os.

For case 2, we cannot predict which one to be used, or process one of them, or process all of them as a list. The behavior cannot be determined from our side. In this case I think we simple should ask user only to zip one data file, which is a reasonable request.

For case 1, it'll be ideal if we can deal with it automatically. However there are too many possibilities. We could use some heuristics like just process the largest file, but again I feel this is user's responsibility to get a single file zip.

For case 3, the Mac os case is common enough and can be dealt with the simple method of process first file. If we want to deal with more cases, it could soon became too complex. For example some windows utility may create a zip file with folder inside it even there is only one file. Then the file path will have folder name which could make your filter fail because the folder name could have . inside it.

summary

I think multiple data files zip should not be supported. Just ask user to prepare single file zip. We have enough warning and information about this already.
Mac os user may followed our suggestion and created single file zip, we need to deal with this and it can be done simply with process first file.
The other cases with really multiple files would be too complex and unpredictable. Trying to process them with some rules defined by us may actually surprise user.

chfleming commented 7 years ago

I agree with 1 & 3.

For the issue of hidden files, I think we should take the file name list (Name?) and filter out everything beginning with "__MACOSX", everything beginning with "." (UNIX hidden file), and everything ending with "$" (Windows hidden file). Then if there is one Name left, proceed, else, stop.

ctmm-initiative / ctmm

decompress zip to temp file for fread #7

summary