AntonOresten / VectorizedKmers.jl

Fast K-mer counting in Julia
https://AntonOresten.github.io/VectorizedKmers.jl/
MIT License
9 stars 0 forks source link

Kmers.jl #25

Open AntonOresten opened 1 year ago

AntonOresten commented 1 year ago

so I just found out about this neat in-development Kmers.jl package. It has this Kmer type, a very intricate type ecosystem, and fancy functions.

I did a microbenchmark on a k-mer counting vector function thing and the implementation was orders of magnitude slower than my implementation that looks at the data of the sequence directly.

function count_kmers2(seq::LongDNA{2}, k::Integer)
    kcv = KmerCountVector{4, k}()
    counts = kcv.counts
    for (i, kmer) in EveryKmer(seq, Val{k}())
        counts[kmer.data[1] + one(UInt)] += 1
    end
    kcv
end

image

Moreover, vectorized k-mer counting can only easily be done directly with Kmers.jl with 2-bits/base sequences, cause of how the data is stored when we allow for ambig nucs (4 bits/base). for my own implementation of vectorized k-mer counting on LongDNA{4}, I solve this by calling trailing_zeros on every 4 bits in the data vector, to get the 2-bit representations of each base (this also disambiguates the base to the first possible base in the order A, C, G, T). if this was done on 4-bits/base DNAKmers, it'd just be a sh*tshow really. a sliding window is the way to go. these Kmer thingies can probably be used for a generalized function that applies to all BioSequences though. but i'm still not sure how to handle ambig sequences...

sorry about poorly structuring this issue. just wanted to get my thoughts out there. may also have f'd up the grammar, spelling, and capitalization a bit. heh.

camilogarciabotero commented 1 year ago

Hey Anton,

Thanks for working on this package! There's been some discussion regarding Kmers.jl in Discourse and Slack (would be nice if you could join and and help to improve the bioinformatics community), anyways the summary is that the package is still unfinished, but there is now a work in progress to finish it. My guess is that any discussion hopefully will be fruitful!