dsnet / compress

Collection of compression related Go packages.
BSD 3-Clause "New" or "Revised" License
408 stars 25 forks source link

Question: Is there an example of buffered read/seek in chunks? #73

Open marcellmars opened 1 year ago

marcellmars commented 1 year ago

I was playing with bgzf archives and it was fairly easy to use bgzf.Reader in bufio.Reader so the the archive could be read in chunks. In one pass I would make an useful index of offsets so later on I could use the very large archive as if it was memory mapped file on the disk.

I tried to find if there's any example of using xflate in the similar way. All of the examples I could find would read the whole compressed archive into the memory.

So, my question is, is there an example of buffered read/seek in chunks of the compressed "xflated" archive?

I found the custom implementation of what I tried to describe here via io.ReadSeeker as not trivial one. So if there's already an example I would appreciate it immensely :)

dsnet commented 1 year ago

I don't quite understand how your use of bgzf works to begin with, so I'm unable to suggest an equivalent use with xflate. Do you have an example?

dsnet commented 1 year ago

Also, keep in mind that XFLATE operates differently than BGZF. BGZF is effectively a linked-list of independently compressed segments, so you need to read through the whole file to determine the boundaries of each segment. In contrast, XFLATE contains an index that reports the location of each segment in O(1). Thus, you can seek to the middle of an XFLATE file without needing to ever read all the content before that point.

marcellmars commented 1 year ago

ok. here's the use case.

i have a very large json per line gzipped file. with bgzf i do two passes.

in first pass i use gzip.NewReader which is then used in bufio.NewReader where i do .ReadBytes('\n') to find line by line. then i pass the read line to bgzf.Writer write it and flush every million lines. that's how i end up with bgzf gzipped archive.

in second pass i use bgzf.Reader where i pass the *os.File and do the same like in the first pass: bufio.NewReader/.ReadBytes('\n') line by line and make an index where id from json is a key and add bgzf.Chunk as a value.

that works fine. very little RAM and fast enough moving through the compressed file finding the particular json via its id.

meanwhile i played with rac and did somewhat similar approach. there i didn't have to do two passes against a compressed archive as rac accept index with offset/length values made against the uncompressed file. so i made that index against the uncompressed .jsonl file. for rac i pass *os.File to rac.Reader and its .SeekRange prepares the rac.Reader to give all of its content via io.ReadAll given i provided the offset and length (made against uncompressed file) to the .SeekRange.

so both of this experiments gave me the way to query the very large compressed archive of many lines of json records for some chunk where a particular json will be found. for any particular query it uses very little RAM and it is fairly fast.

i am sure XFLATE could be used for this use case. i just couldn't figure out how to use the reference to the compressed archive (e.g. *os.File) and provide the offset/length so XFLATE gives me the desired json. i already have index made against the uncompressed file so i wonder if i can use it for XFLATE.

i hope this explains it better.

just to mention: i managed to use xflate.NewWriter and write XFLATE archive in chunks and i can read from it with gzip.NewReader.