Open sdwfrost opened 4 years ago
Hi Simon,
I can't recall why I chose this encoding and I'm open to changing it. I think I looked at the squeakr (https://github.com/splatlab/squeakr/blob/master/include/kmer.h) encoding when I was building this. Does the lh3 encoding allow reverse-complement by just flipping all the bits? I do want to keep 2-bit encoding. If we have a separate bit for N, then 2-bit won't work.
this sounds great to me. adding @zeeev and cdunn2001 as they have a similar library that also does minimizer stuff (though I can't find it now)
Dear Brent,
The coding doesn't matter greatly, but I'm trying to retain compatibility with minimap2. I deal a lot with viruses, that have lots of ambiguities, so I'm leaning towards 4-bit representation along the lines of [naf](https://github.com/KirillKryukov/naf]. In minimap2, minimizers that contain Ns are skipped. I'll have to hunt down what happens with query sequences.
Thanks for the pointer (I'll add @cdunn2001 esp as he's local to me). I have to take a break for a few days, but I can add and put in a PR if there isn't anything else out there.
Sorry for slow response. (I am also @pb-cdunn )
We had some stuff here. But it's not really ready to be used as a sub-module yet.
I agree with 2-bit encoding, and ours uses 2-bit also. I'd prefer a whole new library for dealing with Ns. The efficiency really is that important.
Thanks @cdunn2001! Perhaps as a compromise for now we can encode ambiguities by resolving e.g. to the smallest lexicographic character (Y/y = C etc.)? For future discussion, perhaps we can have generics for 2bit vs. 4bit representations?
Dear @brentp
Nice Nim work, as usual. I just had a few queries/suggestions:
It's then straightforward to make minimizers. Is this something you'd be happy to add?