TGAC / KAT

The K-mer Analysis Toolkit (KAT) contains a number of tools that analyse and compare K-mer spectra.
http://www.earlham.ac.uk/kat-tools
GNU General Public License v3.0
200 stars 51 forks source link

Incorrect regions extracted from kat sect with -E flag #148

Open kdelwat opened 4 years ago

kdelwat commented 4 years ago

I have two FASTA files, 1 and 2:

> 1
AAAAAAAACTCTCAAAACCCCCAAA
> 2
CCCCCGGGGGCTCTCGGGGGGGGGGGGGGGG

With k = 5, I would expect two shared 5mers between these files: CTCTC and CCCCC.

I run KAT with the command: kat sect -E -m 5 -N 1.fa 2.fa and get the following kmer counts:

> 1
0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0

Indicating that the 5mers at positions 9 and 18 are shared - this is exactly what we would expect. I also provided the -E flag to KAT which extracts non-repetitive regions (count = 1) to a separate FASTA file:

> 1___region:1_length:4_pos:9:13_cov:1-2
CCTC
> 1___region:2_length:4_pos:18:22_cov:1-2
CCCC

Here, the positions of the shared 5mers are correct, but the length and sequences are incorrect.