BioJulia / Bio.jl

[DEPRECATED] Bioinformatics and Computational Biology Infrastructure for Julia
http://biojulia.dev
MIT License
261 stars 65 forks source link

How to get kmer counts of a particular kmer? #456

Closed CorySimon closed 7 years ago

CorySimon commented 7 years ago

How can I search for the given occurrence of a given kmer, say dna"ACGT"?

search(dna"ACGTTTCGAACGT", dna"AGCT") only yields the first match.

From the docs here, this is a solution:

comp = composition(each(DNAKmer{4}, dna"ACGTTTCGAACGT"))
comp[DNAKmer("ACGT")] # gives 2 , w00t

but it's very inefficient if I am only interested in one kmer. Can you please tell me how to do for only DNAKmer("ACGT")? I can update the docs with this example you provide.

By the way, at the bottom of the docs here, it looks like there are errors being thrown in the second box from the bottom.

TransGirlCodes commented 7 years ago

You can use the part of your code above with an ad-hoc for loop:

count = 0
query = DNAKmer("ACGT")
for kmer in each(DNAKmer{4}, dna"ACGTTTCGAACGT")
    count += (kmer == query)
end

Thanks for pointing out the docs issue. Bio.jl is being decomposed into a more modular and maintainable set of software over the coming months. Kmer tools are going to be moved and further developed in BioJulia/Kmers.jl in the future.

CorySimon commented 7 years ago

@Ward9250 thanks! The above code yielded count=0. Calling the second element works though:

count = 0
query = DNAKmer("ACGT")
for kmer in each(DNAKmer{4}, dna"ACGTTTCGAACGT")
    count += (kmer[2] == query)
end
TransGirlCodes commented 7 years ago

Ah yes that might be because kmer isn't just the kmer sequence value, other things might be returned by the iterator too.

kdm9 commented 7 years ago

@Ward9250 yep, eachkmer yields (postion, kmer)

TransGirlCodes commented 7 years ago

Here's a (bit lengthy) one-liner that collects the kmer instances of interest into an array, so you know how many there are, and where they are.

map(y -> y[1], filter((x) -> x[2] == query, each(DNAKmer{4}, dna"ACGTTTCGAACGT"))))

The eachkmer iterator is wrapped using the "filter" method, meaning only iterations where the kmer matches the query are let through, this filtered iterator is then used in the map function which collects the positions of the matching kmers.

CorySimon commented 7 years ago

Thanks everyone, we can close this.

Relatedly, I think we can use comp = composition(each(DNAKmer{4}, dna"AATTAATTAATTAATTAATT")) to build a tandem repeat classifier! https://tandem.bu.edu/trf/trfdesc.html

TransGirlCodes commented 7 years ago

Thanks for asking @CorySimon and for your interest in our project!