TGAC / KAT

The K-mer Analysis Toolkit (KAT) contains a number of tools that analyse and compare K-mer spectra.
http://www.earlham.ac.uk/kat-tools
GNU General Public License v3.0
206 stars 52 forks source link

Help with interpreting spectra-cn #177

Closed Paul-Donat closed 1 year ago

Paul-Donat commented 1 year ago

I compared pair-end short reads to a final genome assembly. The assembly is an nanopore assembled genome that used these same paired-end short reads in the polishing step. The assembly had haplotigs purged as well. The assembly has a high duplication rate based on BUSCO and close species estimation of duplication.

Below is my spectra-cn. In both kat hist & kat comp the output analysis calls the peak at 23x the homozygous peak. However, based on everything I've read this peak should be the heterozygous peak. The genome estimation using the 23x peak as the homozygous peak is within .4 Gb of the assembly size and within .1 Gb of the published genome size.

spectra-cn

When I force ./distanalysis.py to call the peak at 46x the homologous peak, the genome estimation size plummets to 1/3 of the assembly size and well below the estimated genome size.

Any help or guidance would be appreciated.

Thanks, Paul D.