Open marcellmars opened 1 year ago
I don't quite understand how your use of bgzf
works to begin with, so I'm unable to suggest an equivalent use with xflate
. Do you have an example?
Also, keep in mind that XFLATE operates differently than BGZF. BGZF is effectively a linked-list of independently compressed segments, so you need to read through the whole file to determine the boundaries of each segment. In contrast, XFLATE contains an index that reports the location of each segment in O(1). Thus, you can seek to the middle of an XFLATE file without needing to ever read all the content before that point.
ok. here's the use case.
i have a very large json per line gzipped file. with bgzf i do two passes.
in first pass i use gzip.NewReader
which is then used in bufio.NewReader
where i do .ReadBytes('\n')
to find line by line. then i pass the read line to bgzf.Writer
write it and flush every million lines. that's how i end up with bgzf gzipped archive.
in second pass i use bgzf.Reader
where i pass the *os.File
and do the same like in the first pass: bufio.NewReader
/.ReadBytes('\n')
line by line and make an index where id
from json is a key and add bgzf.Chunk
as a value.
that works fine. very little RAM and fast enough moving through the compressed file finding the particular json via its id
.
meanwhile i played with rac and did somewhat similar approach. there i didn't have to do two passes against a compressed archive as rac
accept index with offset/length values made against the uncompressed file. so i made that index against the uncompressed .jsonl file. for rac
i pass *os.File
to rac.Reader
and its .SeekRange
prepares the rac.Reader
to give all of its content via io.ReadAll
given i provided the offset and length (made against uncompressed file) to the .SeekRange
.
so both of this experiments gave me the way to query the very large compressed archive of many lines of json records for some chunk where a particular json will be found. for any particular query it uses very little RAM and it is fairly fast.
i am sure XFLATE could be used for this use case. i just couldn't figure out how to use the reference to the compressed archive (e.g. *os.File
) and provide the offset/length so XFLATE gives me the desired json. i already have index made against the uncompressed file so i wonder if i can use it for XFLATE.
i hope this explains it better.
just to mention: i managed to use xflate.NewWriter
and write XFLATE archive in chunks and i can read from it with gzip.NewReader
.
I was playing with bgzf archives and it was fairly easy to use
bgzf.Reader
inbufio.Reader
so the the archive could be read in chunks. In one pass I would make an useful index of offsets so later on I could use the very large archive as if it was memory mapped file on the disk.I tried to find if there's any example of using xflate in the similar way. All of the examples I could find would read the whole compressed archive into the memory.
So, my question is, is there an example of buffered read/seek in chunks of the compressed "xflated" archive?
I found the custom implementation of what I tried to describe here via
io.ReadSeeker
as not trivial one. So if there's already an example I would appreciate it immensely :)