Closed TransGirlCodes closed 2 years ago
I don't have a strong intuition here - my naive understanding was that kmers were most computationally useful since you don't have to worry about gaps - if it doesn't match, it doesn't match. But if there are use-cases for it, would be awesome to have the flexibility.
Perhaps just make the gapped versions really hard to make by accident (eg have everything default to 2 bit or error, unless the gaps are explicitly asked for)?
So if you use the shorthand aliases for Kmers, lets say DNAKmer{31}(seq)
, where seq is a 4 bit sequence created with dna"", or read from a fasta file. That would create a kmer using the 2-bit alphabet using the rules I set above, as the aliases for DNA and RNA kmers use the two-bit alphabets as default. To force a 4 bit alphabet you'd have to explicitly ask for it by fully specifiying the kmer type parameterization.
I think yes, it does make sense. Most applications won't use them, and that's OK. That's why we have 2-bit alphabets! But It'd be nice if Kmers were simply stack-allocated generic BioSequences
with statically-known lengths. That would also imply AminoAcidAlphabet
kmers.
Ok, so I'm starting to think that to vastly simplify the codebase and make the logic of kmer iteration clear.
We simply reduce things to iterating over Kmers{A} in LongSequence{A}.
This should make looping over kmer iterators a bit more optimizable - known lengths etc.
If you want Kmers with no ambiguous symbols - don't iterate over sequence types where ambiguous symbols are a possibility. Rather than mucking around with - oh but if it's 4 bit DNA if you hit ambiguity scan until you fill the kmer - the kmer which does not - strictly speaking, exist in the data, and mess with the number of iterations and kmers yielded etc.
While that is a nicer approach principally, most applications I know of takes the approach of skipping ambiguous kmers. That is to say, people who use the library probably want functionality where non-ambiguous kmers are skipped.
That being said, it makes perfect sense to make a simpler iterator which just reads directly from the biosequence, and then another iterator which skips ambiguous symbols.
I'm already going to be doing separate "canonical" iterators. So the case where the user just wants the kmers on the sequence going forward is much simpler. So I think two basic iterators (overlapping and spaced), and then "special behaviour" iterators are ok.
As you say, perhaps we can do some kind of additional iterator or filter. Simple cases, produce the ambiguous kmers, and filter the vector afterwards, or in the case of kmer counters splitting kmers into bins or whatever, we can have the container or data-store type just not include any ambiguous kmers they recieve.
Either way, I think let's do simple first, get v0.1 and then go onwards.
Ok, so for the iterators, there is now:
EveryKmer{K,S} where {K<:Kmer, S<:BioSequence}
EveryCanonicalKmer{K,S} where {K<:Kmer, S<:BioSequence}
The semantics defined for these iterators is they iterate over every valid kmer in a sequence.
So for nucleotides, that means the following behaviour:
If is Sequence{A}
and Kmer{A}
- the most common scenario - simple, predictable length iterator, no encoding or decoding, just bit shifting packing into kmer.
If Sequence{NucleicAcidAlphabet{4}}
and Kmer{NucleicAcidAlphabet{2}}
, the iterator has Base.SizeUnknown()
, and dispatches to an interaction method that skips over ambiguous sites in the 4-bit sequence.
This is because the behaviours of these iterators are they return every valid Kmer{NucleicAcidAlphabet{2}}
in the input sequence, just as for Sequence{A}
and Kmer{A}
, it's just in order to fulfil that brief, they do skipping
But semantically, the behaviour is consistent, no need to flip a flag for skipping behaviour or anything like that.
Same will go for the Stepped Kmers.
We'll make sure to illustrate this in the manual.
Ok So now we have the iterators generally enforcing alphabet consistency between longseq and kmer type, but with exceptions to permit skipping for when kmers are DNA2 and the longseq is DNA4.
Originally they were not, and for my purposes - e.g. genome assembly from raw reads, we don't expect to see them. But I could see applications where they are wanted, e.g. if we allow 4-bit DNA kmers, variation analysis etc.
So my idea is this:
How does this sound?