KAT comp - issue with big genome

TGAC / KAT

The K-mer Analysis Toolkit (KAT) contains a number of tools that analyse and compare K-mer spectra.

http://www.earlham.ac.uk/kat-tools

GNU General Public License v3.0

209 stars 52 forks source link

KAT comp - issue with big genome #165

Open matryoskina opened 2 years ago

matryoskina commented 2 years ago

Hi, I am trying to calculate the kmer profile of this 5.0 Gb genome. Here's the command:

kat comp -t 32 -m 17 -o genome1VSgenome2 -h 'fastq1_R1.fastq.gz fastq1_R.fastq.gz fastq2_R1.fastq.gz fastq2_R.fastq.gz fastq3_R1.fastq.gz fastq3_R3.fastq.gz' genome1.fa genome2.fa

The problem is that the genome statistics are not correct, the final genome size estimate ends up being 0.90 Mb, and the plot is just something weird (no peak detected). I tried with different kmer values (17, 21, 51) but no change. I tried to set -H and -I to 1000000000 but no change. Do you have suggestions? I attach the log file Thanks! slurm-6387284.txt

jonwright99 commented 2 years ago

Hi, I think you have a problem with your command line. You should have the reads as the first parameter, then the genome as the second. You are including a third which makes comp function very differently. The log file looks like you are putting one assembly as the first parameter, another assembly as the second, and the reads as the third which will give odd results.

matryoskina commented 2 years ago

Hi, Thanks for your help! I rerun the analysis with only the fastq and one genome, but the problem is still there. No peak was found. Shall I increase the k-mer size? Or is there something else I am missing? I am attaching the new log file Thanks! slurm-6522577.txt d

jonwright99 commented 2 years ago

Is there a plot created? If so, can you post it?

Also, can you rerun without using -h and, if you set -H you will speed up the run as it won't need to double the hash size many times to find the correct size. I use -H100000000000.

So your command line above should read; kat comp -t 32 -m 17 -H100000000000 -o genome1VSgenome2 'fastq1_R1.fastq.gz fastq1_R.fastq.gz fastq2_R1.fastq.gz fastq2_R.fastq.gz fastq3_R1.fastq.gz fastq3_R3.fastq.gz' genome1.fa

matryoskina commented 2 years ago

There is no plot created from this job. I have one created from a previous run osph0 7 plot

jonwright99 commented 2 years ago

There's something very odd with your reads here, are they paired-end reads? Also, were all the fastq files you have included in the analysis the ones used to generate the assembly? I've seen these type of plots with no peak where the libraries either are not paired-end reads or they had multiple rounds of PCR before sequencing.

matryoskina commented 2 years ago

Yes, reads are all paired-ends. Regarding the assembly, well, the genome was assembled with long read and those short reads were used for misassemblies correction. Then I used an Hi-C library (Illumina paired-end) to get the chromosomes. Do you think I should use this library instead? Also, could I just compare two genomes without illumina reads? Thanks

jonwright99 commented 2 years ago

Ah, that makes sense now. Do you know roughly the coverage of the paired-end reads that you used for misassemblies correction? I'm guessing it quite low and not enough to generate a peak on the plot. KAT is designed to compare an Illumina read dataset to an assembly generated from that dataset to show how the kmer content of the reads is represented in the assembly. Because your datasets have been used differently to generate an assembly, the plots are not working as intended.