TGAC / KAT

The K-mer Analysis Toolkit (KAT) contains a number of tools that analyse and compare K-mer spectra.
http://www.earlham.ac.uk/kat-tools
GNU General Public License v3.0
200 stars 51 forks source link

no genome size estimation result #185

Open AlcaArctica opened 8 months ago

AlcaArctica commented 8 months ago

Hi, I am trying out kat, but I do not get a genome estimation result: I am running: kat hist -m 21 -h 200 -t 38 -v -p png -o wGorVio_kat wGorVio_reads.jf

Here is the log file output:


Kmer Analysis Toolkit (KAT) V2.4.2

Running KAT in HIST mode
------------------------

Loading hashes into memory... done.  Time taken: 356.4s

Bining kmers ... done.  Time taken: 3.4s

Merging counts ... done.  Time taken: 0.0s

Saving results to disk ... done.  Time taken: 0.0s

Creating plot ...
Plotting histograms for: 1
201 element histogram file loaded.
Axis limits:
xmax: 201
ymax: 5188216.0
 done.  Time taken: 2.9s

Analysing peaks
---------------

Analysing distributions for: /lustre/projects/dazzlerAssembly/asm_wGorVio/hifi/qc/reads/kat/wGorVio_kat
Input file generated using K 21
Kmer coverage histogram file detected
Analysing spectra

Creating initial peaks ... done. 1 peaks initially created

  Index    Left    Mean    Right    StdDev      Max    Volume  Description
-------  ------  ------  -------  --------  -------  --------  -------------
      1      80     100      120        10  1421407         0  1/2X

Locally optimising each peak ... done.

  Index    Left    Mean    Right    StdDev      Max     Volume  Description
-------  ------  ------  -------  --------  -------  ---------  -------------
      1   24.02      99   173.98     37.49  1421406  132588455  1/2X

Fitting cumulative distribution to histogram by adjusting peaks ... done.

  Index    Left    Mean    Right    StdDev      Max     Volume  Description
-------  ------  ------  -------  --------  -------  ---------  -------------
      1   10.78      98   185.22     43.61  1421406  152072606  1/2X

Time taken:  0.2s

K-mer frequency spectra statistics
----------------------------------
K-value used: 21
Peaks in analysis: 1
Global minima @ Frequency=12x (490472)
Global maxima @ Frequency=200x (6033723)
Overall mean k-mer frequency: 98x

  Index    Left    Mean    Right    StdDev      Max     Volume  Description
-------  ------  ------  -------  --------  -------  ---------  -------------
      1   10.78      98   185.22     43.61  1421406  152072606  1/2X

Calculating genome statistics
-----------------------------
Assuming that homozygous peak is the largest in the spectra with frequency of: 98x
Homozygous peak index: 0
CAUTION: the following estimates are based on having a clean spectra and having identified the correct homozygous peak!
Estimated genome size: 0.00 Mbp

Creating plots
--------------

Plotting K-mer frequency distributions ... done.  Saved to: None

KAT HIST completed.
Total runtime: 364.3s

my results are:

{
    "k": 21,
    "nb_peaks": 1,
    "global_minima": {
        "freq": 12,
        "count": 490472
    },
    "global_maxima": {
        "freq": 200,
        "count": 6033723
    },
    "mean_freq": 98,
    "peaks": [
        {
            "mean_freq": 98.00000000000003,
            "stddev": 43.61217428090905,
            "count": 1421406,
            "volume": 152072606
        }
    ],
    "hom_peak": {
        "freq": 98,
        "index": 0
    },
    "est_genome_size": 0,
    "est_het_rate": 0.0

Why are the estimated genome size and the estimated het rate zero? I though the histogram was looking fine kat_hist_reads

AlcaArctica commented 8 months ago

alright, I figured out that it is my setting of the -h parameter, which screws with the calculation of the genome size / heterozygosity. when I leave this parameter out, both are calculated without hitch (although the graph is prettier with ;)

guess that also answers my question here: https://github.com/TGAC/KAT/issues/182