bebop / poly

A Go package for engineering organisms.
https://pkg.go.dev/github.com/bebop/poly
MIT License
676 stars 73 forks source link

Implement efficient sequence search #15

Closed Koeng101 closed 4 years ago

Koeng101 commented 4 years ago

An extremely useful function for poly to have is efficient sequence search. This can be used for lots of different operations. Historically, BLAST has been the most popular technology, but BLAST has a few major disadvantages. BLAST is a relatively old technology, and its time complexity is at best O(n), and at worst, O(n^2) where n is the length of your sequence.

How should we implement efficient sequence search in Poly?

Use modern algorithms like Suffix trees, Suffix arrays and FM indexes

When it comes to efficient algorithms for searching sequence, there are 3 primary algorithms to be interested in - suffix trees, suffix arrays, and FM indexes. All three can reduce sequence search down to log time complexity. Suffix trees take a large quantity of memory, but are extremely fast for certain domains of problems (like the longest common substring problem). Suffix arrays are much simpler and smaller, but can't do as many interesting things as suffix trees can. However, they do have the unique advantage in Go of being implemented in the standard library.

FM indexes are relatively more complex than both suffix arrays and suffix trees (if you want to learn more, watch this video on suffix tries/trees, burrows-wheeler transform, and finally the FM index). While they are more computationally intensive than suffix trees or arrays and are much more complicated than suffix trees or arrays, they take up far less space.

Given popular tools like Bowtie2 and https://ccb.jhu.edu/software/centrifuge/ use the FM index, this is probably what poly should do, right?

FM indexes aren't that great

In 2016, Centrifuge was made for the purpose of very fast sequence search. Basically, they took similar genomes and did deduplication of common genetic elements (based on species and 53 k-mers), then created an FM index of those sequences. The deduplication seems to yield a sequence that is about ~15% the size of the original entry. In 2016, the NCBI sequence databases, when duplicates are removed, is about 109 billion base pairs.

Stored as a suffix array, this would take 147 gigabytes of memory (with the Golang suffix array implementation. (each base set to int8, so index + storage takes 9 bytes per base)

Stored as a FM index, this took 69 gigabytes of memory.

Centrifuge's naive deduplication algorithm was able to achieve about 6.6x reduction in memory usage overall, while FM indexes were only able to get 2.1x reduction from suffix arrays alone.

Deduplication should be the primary focus

In essence, it seems like what is saving a real quantity of memory isn't necessarily the FM index and all the complexity it brings along, but the simple act of deduplicating data. Thus, to do effective sequence search, I think Poly should use the basic Golang suffix arrays and invest energy into figuring out deduplication of genetic data. If Poly is meant for forward engineering, we can expect that DNA parts will be reused quite often, which is an opportunity for creating a very efficient deduplication scheme. Centrifuge's deduplication scheme doesn't have great support for circular sequences, which is something Poly should have.

Deduplication is also a very nice feature for commercial operations that need to save lots of sequence data, such as data coming through sequencers. I'm not sure how to implement this yet, but we should think about it. Please post in this thread if you have any ideas.

Koeng101 commented 4 years ago

Here is a potential solution to non-naive deduplication.

The general idea is that to do deduplication you find the longest common substring (L) that occurs a certain number of times (K) in a concatenated string with all known DNA (S). If you graph L as Y and K as X, you're going to find that there is some power-law like distribution, where for few K you have a very large L, and as you increase K, L gets shorter.

We have a function that we're optimizing for, however, which is to save space. The amount of bytes saved (BS) is the value of LK, which graphed against K, will be a wavy-looking function that has a few dips, or minimal regions. To achieve good compression, you yoink the dips that minimize LK, remove the occurrences of the L regions from S, and keep them separately in string X, noting what portions you've taken out.

X becomes the new actual index, and then you just iterate on the S, appending things to X, until it's not worth doing anymore (the index storage becomes more costly than just redundantly storing a few bytes), and then just slap the remaining S onto X.

In the end, you will have a map of parts -> []indexes, and the X string. Take the suffix array (SA) of X, and you are left with 3 data structures: the map, X, and SA, which should represent approximately the minimal space requirements for any given set of DNA strings. Then, for figuring out matches, use an algorithm like Centrifuge on the suffix array, doing hops when you encounter index edges.

Since you got a suffix array at the end with a string, you can use native Go-stuff for fast search. This is a pretty similar idea to the algorithm introduced in A new algorithm for “the LCS problem” with application in compressing genome resequencing data. Basically, they take the suffix tree, and then take the shortest path graphs. It's kind of complicated, while my solution above is basically to just take the suffix array (native Go) plus the LCP and then plug-n-chug with concurrency to brute force out the best L*K regions to minimize.

Thoughts @TimothyStiles ?

TimothyStiles commented 4 years ago

@Koeng101 this looks great and fits into plans for what I'm tentatively calling, "poly store".

It looks like insertion time for search cache to work is roughly KN where K is number of cached sequences, and N is new input sequences. For sufficiently large stores there's probably going to need to be a daemon that manages making initial input searchable while applying updates to the indexes and X string in a way that doesn't bung queries that are processing at the same time. It's mores of a minor detail but have you given thought to that?

Koeng101 commented 4 years ago

42 Moving discussion there for poly store.