brentp / nim-kmer

DNA kmer operations for nim
MIT License
14 stars 2 forks source link

Tweaks? #1

Open sdwfrost opened 4 years ago

sdwfrost commented 4 years ago

Dear @brentp

Nice Nim work, as usual. I just had a few queries/suggestions:

  1. Can I ask why the encoding is the way it is? It seems strange to code all non-ACGT as A rather than N - is there a downstream reason for this? The indexing is different from the ACGTN of 0-4 used by Heng Li, and I wondered why you diverged from this.
  2. It would be useful to add functionality for minimizers - following the minimap2 algorithm, one uses an invertible hash function e.g.
proc hash*(k: uint64): uint64 =
  var ky = k
  ky = (not ky) + (ky shl 21)
  ky = ky xor (ky shr 24)
  ky = (ky + (ky shl 3)) + (ky shl 8)
  ky = ky xor (ky shr 14)
  ky = (ky + (ky shl 2)) + (ky shl 4)
  ky = ky xor (ky shr 28)
  ky = ky + (ky shl 31)
  return ky

proc inverse_hash*(k: uint64): uint64 =
  var tmp: uint64 = 0
  var ky = k
  tmp = ky - (ky shl 31)
  ky = ky - (tmp shl 31)
  tmp = ky xor ky shr 28
  ky = ky xor tmp shr 28
  ky = ky * 14933078535860113213'u64
  tmp = ky xor ky shr 14
  tmp = ky xor tmp shr 14
  tmp = ky xor tmp shr 14
  ky = ky xor tmp shr 14
  ky = ky * 15244667743933553977'u64
  tmp = ky xor ky shr 24
  ky = ky xor tmp shr 24
  tmp = not ky
  tmp = not (ky - (tmp shl 21))
  tmp = not (ky - (tmp shl 21))
  ky = not (ky - (tmp shl 21))
  return ky

It's then straightforward to make minimizers. Is this something you'd be happy to add?

brentp commented 4 years ago

Hi Simon,

  1. I can't recall why I chose this encoding and I'm open to changing it. I think I looked at the squeakr (https://github.com/splatlab/squeakr/blob/master/include/kmer.h) encoding when I was building this. Does the lh3 encoding allow reverse-complement by just flipping all the bits? I do want to keep 2-bit encoding. If we have a separate bit for N, then 2-bit won't work.

  2. this sounds great to me. adding @zeeev and cdunn2001 as they have a similar library that also does minimizer stuff (though I can't find it now)

sdwfrost commented 4 years ago

Dear Brent,

  1. The coding doesn't matter greatly, but I'm trying to retain compatibility with minimap2. I deal a lot with viruses, that have lots of ambiguities, so I'm leaning towards 4-bit representation along the lines of [naf](https://github.com/KirillKryukov/naf]. In minimap2, minimizers that contain Ns are skipped. I'll have to hunt down what happens with query sequences.

  2. Thanks for the pointer (I'll add @cdunn2001 esp as he's local to me). I have to take a break for a few days, but I can add and put in a PR if there isn't anything else out there.

cdunn2001 commented 4 years ago

Sorry for slow response. (I am also @pb-cdunn )

We had some stuff here. But it's not really ready to be used as a sub-module yet.

I agree with 2-bit encoding, and ours uses 2-bit also. I'd prefer a whole new library for dealing with Ns. The efficiency really is that important.

sdwfrost commented 4 years ago

Thanks @cdunn2001! Perhaps as a compromise for now we can encode ambiguities by resolving e.g. to the smallest lexicographic character (Y/y = C etc.)? For future discussion, perhaps we can have generics for 2bit vs. 4bit representations?