TGAC / KAT

The K-mer Analysis Toolkit (KAT) contains a number of tools that analyse and compare K-mer spectra.
http://www.earlham.ac.uk/kat-tools
GNU General Public License v3.0
206 stars 52 forks source link

Kat comp finds specific kmers between 2 fastq files with the same reads not given in the same order (with a reproducible example) #188

Open jfouret opened 5 months ago

jfouret commented 5 months ago

Hi,

Thank you for this tool. I wanted to use kat comp somehow to validate an other tool for looseless compression of fastq reads. This tool is reordering the reads but should be looseless. I was suprised to see specific kmers after decompression hence to confirm that it's not an artefact I wanted to confirm that when I give 2 identical set of reads to kat comp, but not in the same order, I would have 0 specific kmers.

However that is not the confirmation I had. Maybe I mad a mistake somewhere or there may be artifacts in kat comp.

Below is a code to reproduce my results:

SRR=SRR14237206
apptainer run docker://ncbi/sra-tools prefetch $SRR
apptainer run docker://ncbi/sra-tools fasterq-dump $SRR \
  --split-files --progress
pigz -p 8 ${SRR}_* 
mkdir fastq ; mv ${SRR}_*.fastq.gz fastq/
mkdir shuffle
apptainer run docker://staphb/seqkit seqkit shuffle fastq/${SRR}_1.fastq.gz --out-file shuffle/${SRR}_1.fastq.gz
apptainer run docker://ghcr.io/nexomis/kat:2.4.1 comp -N -O -H 1000000000 -I 1000000000 -t 12 fastq/${SRR}_1.fastq.gz shuffle/${SRR}_1.fastq.gz

I got those results:

$ apptainer run docker://ghcr.io/nexomis/kat:2.4.1 comp -N -O -H 1000000000 -I 1000000000 -t 12 fastq/${SRR}_1.fastq.gz shuffle/${SRR}_1.fastq.gz
INFO:    Using cached SIF image
Kmer Analysis Toolkit (KAT) V2.4.1

Running KAT in COMP mode
------------------------

Input 1 is a sequence file.  Counting kmers for input 1 (fastq/SRR14237206_1.fastq.gz) ... done.  Time taken: 32.2s

Input 2 is a sequence file.  Counting kmers for input 2 (shuffle/SRR14237206_1.fastq.gz) ... done.  Time taken: 34.2s

Comparing hashes ... done.  Time taken: 27.0s

Merging results ... done.  Time taken: 0.7s

Saving results to disk ... done.  Time taken: 0.3s

Summary statistics
------------------

K-mer statistics for: 
 - Hash 1: "fastq/SRR14237206_1.fastq.gz"
 - Hash 2: "shuffle/SRR14237206_1.fastq.gz"

Total K-mers in: 
 - Hash 1: 1464945516
 - Hash 2: 1464945516

Distinct K-mers in:
 - Hash 1: 364458959
 - Hash 2: 364458959

Total K-mers only found in:
 - Hash 1: 0
 - Hash 2: 131916277

Distinct K-mers only found in:
 - Hash 1: 0
 - Hash 2: 129068475

Shared K-mers:
 - Total shared found in hash 1: 1464945516
 - Total shared found in hash 2: 1464945516
 - Distinct shared K-mers: 364458959

Distance between spectra 1 and 2 (all k-mers):
 - Manhattan distance: 0
 - Euclidean distance: 0
 - Cosine distance: 1.11022e-16
 - Canberra distance: 0
 - Jaccard distance: 0

Distance between spectra 1 and 2 (shared k-mers):
 - Manhattan distance: 0
 - Euclidean distance: 0
 - Cosine distance: 1.11022e-16
 - Canberra distance: 0
 - Jaccard distance: 0

Creating plot(s) ... done.  Time taken: 1.3s

Analysing peaks for spectra copy number matrix
----------------------------------------------

Analysing distributions for: kat-comp-main.mx ... 
Analysing full spectra
No peaks detected for full spectra.  Can't continue.
done.  Time taken:  0.0s

Main spectra statistics
-----------------------
K-value used: 27
Peaks in analysis: 0
Global minima @ Frequency=2x (1420224)
Global maxima @ Frequency=9x (10974317)
Overall mean k-mer frequency: 0x

No peaks detected

Calculating genome statistics
-----------------------------
No peaks detected, so no genome stats to report
Estimated assembly completeness: Unknown

Creating plots
--------------

No peaks in K-mer frequency histogram.  Not plotting.

KAT COMP completed.
Total runtime: 96.7s

What I do not understand is that :

Total K-mers only found in:
 - Hash 1: 0
 - Hash 2: 131916277 <=============================================

Distinct K-mers only found in:
 - Hash 1: 0
 - Hash 2: 129068475  <=============================================

Thank you,