alexpreynolds / kmer-counter

Count kmers with a more efficient (faster) hash table
MIT License
24 stars 5 forks source link

Incorrect Palindrome counts #5

Closed mmcguffi closed 5 years ago

mmcguffi commented 5 years ago

I have found this very useful thank you! However, your palindrome counts are incorrect.

eg, running the script on this sequence "CTAGA" for length 4, returns: {'CTAG': 1, 'TAGA': 1, 'TCTA': 1} while the correct answer would be: {'CTAG': 2, 'TAGA': 1, 'TCTA': 1}

the simple python fix for this is: d={k if Seq(k)==Seq(k).reverse_complement() else k:v*2 if Seq(k)==Seq(k).reverse_complement() else v for k, v in d.items()}

Though I cannot comment on the c++ implementation

alexpreynolds commented 5 years ago

Thanks for the feedback. This code decides that a palindrome would not be counted twice: https://github.com/alexpreynolds/kmer-counter/blob/master/kmer-counter.cpp#L212-L218

$ more /tmp/test.fa
>foo
CTAGA
$ ../kmer-counter --no-rc --k=4 --fasta /tmp/test.fa
>foo    CTAG:1 TAGA:1

Even when not using --no-rc, CTAG would still be a "canonical" mer either way, I'd think, which is why I still only count it once.

$ ../kmer-counter --k=4 --fasta /tmp/test.fa
>foo    CTAG:1 TAGA:1 TCTA:1

I'm on the fence about double-counting palindromes. Some have said it would be an error to do so. Maybe it should be a user-specified option?

mmcguffi commented 5 years ago

As far as as I see it, it is not a double count? In this particular situation 2 strands are counted which correspond to:

CTAGA & TCTAG

which means there are 2 tetramers per strand, and therefore 4 tetramers total, with CTAG being on both the top and bottom strands.

Perhaps I am naive about how some bioinformaticians typically score this informational content, but the method I outlined is the approach that is useful to me. Im not sure if I understand the other side of the argument, but a command line flag would certainly solve the issue for me and be much appreciated!

alexpreynolds commented 5 years ago

Added commit 28da9bc with --double-count-palindromes option to force double-counting of palindromes. Please let me know if this helps you out or you still run into any problems.