Implement Burrows-Wheeler transform

carreter commented 11 months ago

@TimothyStiles can you elaborate on what exactly it is we need here and triage the issue here + in the roadmap?

TwFlem commented 9 months ago

I'd love to take a whack at this!

@TimothyStiles might have some other intentions in mind, but Burrows Wheeler Transform (BWT) can be used as a data structure to query whether or not a sub sequence within a sequence exists.

This can be useful for determining whether or not a sub sequence of nucleotides exist within a sequence of nucleotides.

It is very memory efficient and exact matches on reads are fast.

There is also some room for backtracking on seeming mismatches for some fuzzy sub sequence searching and maybe some other opportunities to tune the BWT like reporting the N number of locations where this sub sequence exists.

Maybe the exact case is a good first step? Do we want to report the location of the match as well- seems important?

Koeng101 commented 9 months ago

There is also some room for backtracking on seeming mismatches for some fuzzy sub sequence searching and maybe some other opportunities to tune the BWT like reporting the N number of locations where this sub sequence exists.

On the alignment side, I think this basically becomes bwa https://bio-bwa.sourceforge.net . What I think could be neat is if you could somehow make it get 98% matching working, so it could be used for #396 in auto-annotating features. Right now that is done in plannotate with BLAST, but I'm pretty sure it is done that way because BLAST is an easy tool to just import and use. This could be a good application!

For actual large-scale sequence alignment, minimap2 is probably a better route - tried and true with nanopore-type alignment, which IMO is the best upcoming DNA sequencing method. (Also it's what I use right now)

TimothyStiles commented 9 months ago

@TimothyStiles might have some other intentions in mind, but Burrows Wheeler Transform (BWT) can be used as a data structure to query whether or not a sub sequence within a sequence exists.

This can be useful for determining whether or not a sub sequence of nucleotides exist within a sequence of nucleotides.

It is very memory efficient and exact matches on reads are fast.

There is also some room for backtracking on seeming mismatches for some fuzzy sub sequence searching and maybe some other opportunities to tune the BWT like reporting the N number of locations where this sub sequence exists.

Maybe the exact case is a good first step? Do we want to report the location of the match as well- seems important?

@TwFlem I'd love to see you take a crack at this. At a conference on my phone rn.

What we're ultimately looking for is a fast BWT implementation that essentially compresses our sequence to be used for search, alignment, etc.

BWT is common enough that LLMs can create a decent naive implementation that can be easily unit tested (I've actually done this).

What I'd suggest is that we get a naive implementation going that we can merge early with high test coverage and documentation then optimize that implementation with @matiasinsaurralde's or @soypat's, or @/josharian's (on discord) input. That way we don't get stuck on optimization early and different people can focus on optimization, and downstream applications at the same time without blocking each other.

github-actions[bot] commented 6 months ago

This issue has had no activity in the past 2 months. Marking as stale.

bebop / poly

Implement Burrows-Wheeler transform #360