Should gaps be permitted in Kmers?

TransGirlCodes commented 2 years ago

Originally they were not, and for my purposes - e.g. genome assembly from raw reads, we don't expect to see them. But I could see applications where they are wanted, e.g. if we allow 4-bit DNA kmers, variation analysis etc.

So my idea is this:

[x] Building a 2 bit kmer from a 2 bit long seq = straight forward, basically just taking the bits, putting them in the kmer.
[x] Building a 2 bit kmer from a 4 bit seq = take nucs from 4 bit seq, encode, and put into kmer - skipping over any non 2-bit permitted chars until full.
[x] Building a 4 bit kmer from a 4 bit seq = take bits from seq and put in kmer - so allow all characters.
[x] Building a 4 bit seq from a 2 bit seq = just take nucs, recode as 4 bits, put into kmer.

How does this sound?

kescobo commented 2 years ago

I don't have a strong intuition here - my naive understanding was that kmers were most computationally useful since you don't have to worry about gaps - if it doesn't match, it doesn't match. But if there are use-cases for it, would be awesome to have the flexibility.

Perhaps just make the gapped versions really hard to make by accident (eg have everything default to 2 bit or error, unless the gaps are explicitly asked for)?

TransGirlCodes commented 2 years ago

So if you use the shorthand aliases for Kmers, lets say DNAKmer{31}(seq), where seq is a 4 bit sequence created with dna"", or read from a fasta file. That would create a kmer using the 2-bit alphabet using the rules I set above, as the aliases for DNA and RNA kmers use the two-bit alphabets as default. To force a 4 bit alphabet you'd have to explicitly ask for it by fully specifiying the kmer type parameterization.

jakobnissen commented 2 years ago

I think yes, it does make sense. Most applications won't use them, and that's OK. That's why we have 2-bit alphabets! But It'd be nice if Kmers were simply stack-allocated generic BioSequences with statically-known lengths. That would also imply AminoAcidAlphabet kmers.

TransGirlCodes commented 2 years ago

Ok, so I'm starting to think that to vastly simplify the codebase and make the logic of kmer iteration clear.

We simply reduce things to iterating over Kmers{A} in LongSequence{A}.

This should make looping over kmer iterators a bit more optimizable - known lengths etc.

If you want Kmers with no ambiguous symbols - don't iterate over sequence types where ambiguous symbols are a possibility. Rather than mucking around with - oh but if it's 4 bit DNA if you hit ambiguity scan until you fill the kmer - the kmer which does not - strictly speaking, exist in the data, and mess with the number of iterations and kmers yielded etc.

jakobnissen commented 2 years ago

While that is a nicer approach principally, most applications I know of takes the approach of skipping ambiguous kmers. That is to say, people who use the library probably want functionality where non-ambiguous kmers are skipped.

That being said, it makes perfect sense to make a simpler iterator which just reads directly from the biosequence, and then another iterator which skips ambiguous symbols.

TransGirlCodes commented 2 years ago

I'm already going to be doing separate "canonical" iterators. So the case where the user just wants the kmers on the sequence going forward is much simpler. So I think two basic iterators (overlapping and spaced), and then "special behaviour" iterators are ok.

As you say, perhaps we can do some kind of additional iterator or filter. Simple cases, produce the ambiguous kmers, and filter the vector afterwards, or in the case of kmer counters splitting kmers into bins or whatever, we can have the container or data-store type just not include any ambiguous kmers they recieve.

TransGirlCodes commented 2 years ago

Either way, I think let's do simple first, get v0.1 and then go onwards.

TransGirlCodes commented 2 years ago

Ok, so for the iterators, there is now:

EveryKmer{K,S} where {K<:Kmer, S<:BioSequence} EveryCanonicalKmer{K,S} where {K<:Kmer, S<:BioSequence}

The semantics defined for these iterators is they iterate over every valid kmer in a sequence.

So for nucleotides, that means the following behaviour: If is Sequence{A} and Kmer{A} - the most common scenario - simple, predictable length iterator, no encoding or decoding, just bit shifting packing into kmer.

If Sequence{NucleicAcidAlphabet{4}} and Kmer{NucleicAcidAlphabet{2}}, the iterator has Base.SizeUnknown(), and dispatches to an interaction method that skips over ambiguous sites in the 4-bit sequence.

This is because the behaviours of these iterators are they return every valid Kmer{NucleicAcidAlphabet{2}} in the input sequence, just as for Sequence{A} and Kmer{A}, it's just in order to fulfil that brief, they do skipping But semantically, the behaviour is consistent, no need to flip a flag for skipping behaviour or anything like that.

Same will go for the Stepped Kmers.

We'll make sure to illustrate this in the manual.

TransGirlCodes commented 2 years ago

Ok So now we have the iterators generally enforcing alphabet consistency between longseq and kmer type, but with exceptions to permit skipping for when kmers are DNA2 and the longseq is DNA4.

BioJulia / Kmers.jl

Should gaps be permitted in Kmers? #11