djvanderlaan / LaF

An R package for fast access to large ASCII files
53 stars 9 forks source link

Read Gzipped file/connection #3

Open nemartins opened 10 years ago

nemartins commented 10 years ago

Hi.

Is it possible to read (b)gzipped files?

Nelson

djvanderlaan commented 10 years ago

No, that's unfortunately not possible and it is also difficult to implement this in LaF. With LaF it is possible to read in random lines, e.g. first read one of the columns and then read in rows based on values in that first column. This means random file access and that is difficult to do with b/gzipped files.

You can of course first uncompress the file from R and then call LaF, but I assume that is not what you want.

nemartins commented 10 years ago

Hi Thank you for the prompt reply. I imagined that random access would be a problem. There's the tabix index for random access to bgzipped files, but that would for a very specific usage (namely VCF files for NGS data)

Thanks anyway

Nelson

On Thu, Dec 5, 2013 at 9:11 PM, Jan van der Laan notifications@github.comwrote:

No, that's unfortunately not possible and it is also difficult to implement this in LaF. With LaF it is possible to read in random lines, e.g. first read one of the columns and then read in rows based on values in that first column. This means random file access and that is difficult to do with b/gzipped files.

You can of course first uncompress the file from R and then call LaF, but I assume that is not what you want.

— Reply to this email directly or view it on GitHubhttps://github.com/djvanderlaan/LaF/issues/3#issuecomment-29938448 .

djvanderlaan commented 10 years ago

I had a look through my mail archives, because I remembered that you weren't the first to ask this. My answer then was roughly the same as I gave you. I also found a link to a discussion on stackoverflow on random file access in zipped files: http://stackoverflow.com/questions/2526930/random-access-gzip-stream. There they also mention the BGZF format, which I assume is the same you mention. In theory it should be possible to add this to LaF, but this is quite a lot of work. Except when there is a very large demand for something like this or a very clear (performance) advantage, I don't think I'll find time to work on it.