bebop / poly

A Go package for engineering organisms.
https://pkg.go.dev/github.com/bebop/poly
MIT License
663 stars 70 forks source link

Add svb compression to slow5 #341

Closed Koeng101 closed 11 months ago

Koeng101 commented 11 months ago

Replaces #328 , merging this time with ioToBio instead to take advantage of the new parser format.

Koeng101 commented 11 months ago

Moving #328 here to merge with ioToBio.

https://github.com/TimothyStiles/poly/pull/328#discussion_r1310457760 is relevant - I think this is just looking ahead to blow5 compatibility.

To quote the importance of just the svb compression:

I've seen reduction in total database size (storing fastq and slow5 reads) from 12Gb to 7.1Gb using just this one weird trick (database engineers hate him)

This is without implementing the other tricks from blow5.

I also added in examples for how to use the compression / decompression

carreter commented 11 months ago

Continuing the thread from over there (https://github.com/TimothyStiles/poly/pull/328#discussion_r1321785254):

I'm using these functions for taking the rawSignal in and out of an SQL database, so having it be a method doesn't work as well as a raw function.

CompressedRead I think would just be a blow5 read rather than slow5 read, which we SHOULD implement at some point. (there is a whole binary format specifically for doing those compressed reads well, with some real performance improvements)

If this is the case, I think we should entirely hold off on merging this PR in favor of doing a blow5 PR instead. As it stands, I don't really see a use case for other clients of this package. If someone wants to perform this compression, they can just import your svb package themselves.

Koeng101 commented 11 months ago

Continuing the thread from over there (#328 (comment)):

I'm using these functions for taking the rawSignal in and out of an SQL database, so having it be a method doesn't work as well as a raw function. CompressedRead I think would just be a blow5 read rather than slow5 read, which we SHOULD implement at some point. (there is a whole binary format specifically for doing those compressed reads well, with some real performance improvements)

If this is the case, I think we should entirely hold off on merging this PR in favor of doing a blow5 PR instead. As it stands, I don't really see a use case for other clients of this package. If someone wants to perform this compression, they can just import your svb package themselves.

Breaking it down into two parts: 1 blow5 vs svb, and 2 svb triviality

I strongly disagree that blow5 fulfills the same use case:

  1. blow5 uses svb, but the use-case intention is completely different. These slow5 package functions are so that you can use slow5 with different data stores (like an SQL database). blow5 cannot fulfill that use case, because you want to query on the raw data, so it doesn't exist as the binary format. However, the one column of data that takes the majority of space that won't be queried on directly can be compressed with svb to halve the needed storage, which is the limiting factor for these kinds of data files. blow5 is for binary storage, probably putting the Nanopore runs in an object store, while slow5+svb is much better for a document-style database.
  2. I am currently storing ~14 actual sequencing runs in a database with this format, with more coming in every week. I personally have real-world experience that this both works well in actual practice, and that blow5 does not fulfill the niche.

I also disagree that svb is trivial:

  1. svb implementation is non-trivial (before I forked you couldn't actually do it with pre-existing libraries), and here we have it tested in the single context that it is useful. Unlike gzip or zstd, it isn't messing with an io.Reader/io.Writer. Trivial would be piping an io.Reader, not two make([]xint) plus a type conversion loop. What is a mask array? Do you need it? What inputs need to be saved for decompression? Having it there in a function increases readability of an unfamiliar compression/decompression type.
  2. There is messaging with using svb in this way and having functions built-in to do it. The performance increase is pretty darn great compared to other compression methods, and by including it we are saying "THIS is how you should compress raw reads".

As it stands, I don't really see a use case for other clients of this package

I specifically talk about the use-case for other clients in the context comment.

carreter commented 11 months ago

Thank you for the clarifications re: slow5+svb vs blow5 and svb triviality. I completely missed the context comment, so sorry!

I'm on board with adding this functionality - reviewing now.