COMBINE-lab / kmers

A bit-packed k-mer representation (and relevant utilities) for rust
BSD 3-Clause "New" or "Revised" License
47 stars 10 forks source link

Roadmap / TODO #6

Open rob-p opened 3 years ago

rob-p commented 3 years ago

This issue will provide a roadmap for the library, along with specific tasks (TODOs). Ideally we should break these tasks into short and long term tasks and, as the library becomes more mature, tie individual tasks to specific release candidates.

d-cameron commented 2 years ago

It would be great if we defined the scope of the library. Specifically:

If it's a genomics, what's in scope:

I've been implementing a Rust OLC assembler and I've found that there's a whole lot of 2-bit sequence functions that I need that aren't in other rust libraries (such as 10X Genomics debruijn library). They're not kmer-based functions per se but they generally are decomposable in ones (e.g. hamming distance between sequences).

rob-p commented 2 years ago

Hi @d-cameron,

Thanks for bringing these up. I think it's a great point. I certainly am not envisioning this as a general string k-mer library. However, I would like to get input from others on if we should support something in addition to the standard DNA alphabet. Specifically, I think there could be legitimate uses for having a code path that supports e.g. a protein alphabet.

The use cases I am most interested in, however, are in the standard 4 nucleotide alphabet. Regarding the encoding scheme, @Daniel-Liu-c0deb0t brought raised the issue in #3, and there was a bit of discussion of the relative merits of different schemes. I'd certainly be interested in any input you have on this.

Finally, while I intend for the focus of this library to be efficient k-mer creation, storage, manipulation and processing, I am absolutely open to having relevant functionality incorporated as either part of this library or as part of a sister crate.

--Rob

d-cameron commented 2 years ago

I intend for the focus of this library to be efficient k-mer creation

rust-debruijn appears to have a similar scope with specialised structs for small(ish) kmers. One consequence of this is that they have a 2-bit encoding for genomic sequences to enable efficient kmer extraction (e.g. sequence.kmer(offset)). Unfortunately, since it's a de Bruijn graph targeted crate, there's not a lot of support for doing stuff on these sequences other than extracting kmers.

If this library wants to take a similar approach that's fine but if it does, it would be great if it supported/integrated with sequences encoded by crates that have more comprensive feature sets. I believe support for the various encodings of [u8], [u64] slices should be sufficient.