TGAC / KAT

The K-mer Analysis Toolkit (KAT) contains a number of tools that analyse and compare K-mer spectra.
http://www.earlham.ac.uk/kat-tools
GNU General Public License v3.0
206 stars 52 forks source link

kat_plot_spectra_mx: ValueError: zero-size array to reduction operation maximum which has no identity #97

Open mmokrejs opened 6 years ago

mmokrejs commented 6 years ago

In https://github.com/TGAC/KAT/issues/94#issuecomment-385328131 I mentioned an issue with directly calling:

$ for f in pe_vs_assembly-main.mx pe_vs_assembly-ends.mx pe_vs_assembly-mixed.mx pe_vs_assembly-middle.mx; do
  p=`basename $f .mx`;
  kat_plot_spectra_mx -i -o $p $f;
done
Traceback (most recent call last):
  File "/apps/gentoo/usr/lib/python-exec/python3.5/kat_plot_spectra_mx", line 11, in <module>
    load_entry_point('kat==2.4.1', 'console_scripts', 'kat_plot_spectra_mx')()
  File "/apps/gentoo/usr/lib64/python3.5/site-packages/kat/plot/spectra_mx.py", line 169, in main
    ymax[i] = np.max(peaky) * 1.1
  File "/apps/gentoo/usr/lib64/python3.5/site-packages/numpy/core/fromnumeric.py", line 2272, in amax
    out=out, **kwargs)
  File "/apps/gentoo/usr/lib64/python3.5/site-packages/numpy/core/_methods.py", line 26, in _amax
    return umr_maximum(a, axis, None, out, keepdims)
ValueError: zero-size array to reduction operation maximum which has no identity
Traceback (most recent call last):
  File "/apps/gentoo/usr/lib/python-exec/python3.5/kat_plot_spectra_mx", line 11, in <module>
    load_entry_point('kat==2.4.1', 'console_scripts', 'kat_plot_spectra_mx')()
  File "/apps/gentoo/usr/lib64/python3.5/site-packages/kat/plot/spectra_mx.py", line 169, in main
    ymax[i] = np.max(peaky) * 1.1
  File "/apps/gentoo/usr/lib64/python3.5/site-packages/numpy/core/fromnumeric.py", line 2272, in amax
    out=out, **kwargs)
  File "/apps/gentoo/usr/lib64/python3.5/site-packages/numpy/core/_methods.py", line 26, in _amax
    return umr_maximum(a, axis, None, out, keepdims)
ValueError: zero-size array to reduction operation maximum which has no identity
$ ls -latr
-rw-rw----  1 mmokrejs mmokrejs      175989 Apr 30 00:49 pe_vs_assembly-main.png
-rw-rw----  1 mmokrejs mmokrejs      149201 Apr 30 00:49 pe_vs_assembly-ends.png
$

pe_vs_assembly-main.mx.zip

They were created by:

$ kat comp -t $threads -o pe_vs_assembly pe1.fasta $pe2.fasta

K-mer statistics for:
 - Hash 1: "pe1.fasta"
 - Hash 2: "pe2.fasta"
 - Hash 3: "tt_16D1C3L12__abyss_128-scaffolds.fa"

Total K-mers in:
 - Hash 1: 70216479930
 - Hash 2: 78178155508
 - Hash 3: 1397145511

Distinct K-mers in:
 - Hash 1: 4231587847
 - Hash 2: 6193980310
 - Hash 3: 726865582

Total K-mers only found in:
 - Hash 1: 3200143422
 - Hash 2: 5266346414

Distinct K-mers only found in:
 - Hash 1: 3113038375
 - Hash 2: 5075430838

Shared K-mers:
 - Total shared found in hash 1: 67016336508
 - Total shared found in hash 2: 72911809094
 - Distinct shared K-mers: 1118549472

Distance between spectra 1 and 2 (all k-mers):
 - Manhattan distance: 2.14255e+09
 - Euclidean distance: 1.86879e+09
 - Cosine distance: 0.000130564
 - Canberra distance: 56.2066
 - Jaccard distance: 0.34095

Distance between spectra 1 and 2 (shared k-mers):
 - Manhattan distance: 2.27208e+08
 - Euclidean distance: 3.74309e+07
 - Cosine distance: 0.00745289
 - Canberra distance: 55.5518
 - Jaccard distance: 0.184399

KAT COMP completed.
Total runtime: 15436.3s
$ ls -latr
-rw-rw----  1 mmokrejs mmokrejs     2369879 Apr 29 04:01 pe_vs_assembly-main.mx
-rw-rw----  1 mmokrejs mmokrejs     2004429 Apr 29 04:01 pe_vs_assembly-ends.mx
-rw-rw----  1 mmokrejs mmokrejs        1317 Apr 29 04:01 pe_vs_assembly.stats
-rw-rw----  1 mmokrejs mmokrejs     2055920 Apr 29 04:01 pe_vs_assembly-mixed.mx
-rw-rw----  1 mmokrejs mmokrejs     2005959 Apr 29 04:01 pe_vs_assembly-middle.mx
maplesond commented 6 years ago

Hi @mmokrejs, Looks like something went wrong with your command. Was the $ symbol in front of pe2 intentional? It seems to be picking up a third dataset from somewhere called: tt_16D1C3L12__abyss_128-scaffolds.fa. Possibly you meant to put in a command like this:

kat comp -t $threads -o pe_vs_assembly 'pe1.fasta pe2.fasta' tt_16D1C3L12__abyss_128-scaffolds.fa

maplesond commented 6 years ago

So I've just run this command on your .mx file:

kat plot density -o test1 pe_vs_assembly-main.mx

This produces a file called test1.png: test1

So the data looks like your two datasets are of the same sample and have a 27-mer coverage peak of ~45X. There's also no significant bias which is probably to be expected as if I understand the file naming correctly this is just the R1 and R2 of the same dataset. It also looks like the genome only has a small amount of heterozygosity as well. Was this what you were expecting for the dataset?

mmokrejs commented 6 years ago

Was the $ symbol in front of pe2 intentional?

Ah, sorry, I was hand-editing the github post to make file paths shorter and you are right, the $ should have not been there. There was a full file path instead.

mmokrejs commented 6 years ago

kat comp -t $threads -o pe_vs_assembly 'pe1.fasta pe2.fasta' tt_16D1C3L12__abyss_128-scaffolds.fa

If I was supposed to group the two different PE library pre files into one group and compare it to the assembly, then you are right.

If I can provide two different PE library files (interleaved fwd and rev reads in each) versus assembly, then I would omit the groupping.

What I expected? Truly said, I tried to follow docs and see what happens.

mmokrejs commented 6 years ago

I understand the file naming correctly this is just the R1 and R2 of the same dataset

No, those are two different library preps and sequencing (see the flowcell-names and lane numbers in the filenames).

mmokrejs commented 6 years ago

It also looks like the genome only has a small amount of heterozygosity as well.

Where is the genome represented in the plot? I see only PE1 vs. PE2 being plot. Is in the the scalebar on the right side of the plot?

The coverage at k=27 seems somewhat low. Compare with GenomeScope results (actually the k-mer counting is from ntCard) for various k-mer sizes, notable the k=32 is closest to k=27:

tt_16D1C3L12PE-onlyntCard_k32.histo http://genomescope.org/analysis.php?code=85k6B5aH7tu2vRt7JNMp

tt_16D1C3L12PE-onlyntCard_k64.histo http://genomescope.org/analysis.php?code=BfNXYUjshZ8zWFk8JUf2

tt_16D1C3L12PE-onlyntCard_k96.histo http://genomescope.org/analysis.php?code=pJuy0fKxb5nXns6ynp4M

tt_16D1C3L12PE-onlyntCard_k112.histo http://genomescope.org/analysis.php?code=uIKzl6JvNnoeMPnRQlU6

tt_16D1C3L12PE-onlyntCard_k128.histo http://genomescope.org/analysis.php?code=HBnsyGEI1MNQDu1sB3fW

tt_16D1C3L12PE-onlyntCard_k144.histo http://genomescope.org/analysis.php?code=2ZkJ2eyTW83ZdmDl98mk

tt_16D1C3L12PE-onlyntCard_k156.histo http://genomescope.org/analysis.php?code=jXyExLoKD4kb5whTfgnO

maplesond commented 6 years ago

Hi @mmokrejs, Yes, so in the plot it is treating the coverage of each file separately, so this is exactly half of the genomescope plots, which you show. The y axis in the genome scope plots correlates with the z axis (colour) in KAT.

I think this is correct based on what you asked KAT to do in this instance: compare dataset1 against dataset2 (rather than show the histogram of dataset1 + dataset2 combined). If you want a direct comparison you can change the -m option in KAT to 32 and run kat hist using both files as input.

mmokrejs commented 6 years ago

And how do I get PE1+PE2 on X-axis and assembly-based k-mers on Y-axis? ... to see what is missing or what is redundant in the assembly?

maplesond commented 6 years ago

So yes, if you run something like this:

kat comp -m32 -t $threads -o pe_vs_assembly 'pe1.fasta pe2.fasta' tt_16D1C3L12__abyss_128-scaffolds.fa

Then you will get a png file generated that shows a stacked histogram that will have a profile similar (hopefully identical) to what you see with genomescope, but will be divided according to the frequency at which those k-mers are found in the reference.

This will also generate a matrix file, so if you're not happy with the auto generated ranges and labels, you can modify using a command something like this: kat plot spectra-cn -t <newtitle> -a <new x label> -b <new y label> -x <new x max range> -y <new y max range> -o newoutput pe_vs_assembly.mx

mmokrejs commented 6 years ago

kat comp -m32 -t $threads -o pe_vs_assembly 'pe1.fasta pe2.fasta' tt_16D1C3L12__abyss_128-scaffolds.fa is yet another example hitting deemed Out-of-memory issue, like https://github.com/TGAC/KAT/issues/100

maplesond commented 6 years ago

So you could try using only one of your read files rather than both together. This will reduce memory requirements somewhat. The other option is to reduce your kmer size. If you drop to something like -m23 this will help a lot, at the expense of reducing the uniqueness of each kmer.

maplesond commented 6 years ago

Hi @mmokrejs, is there anything outstanding on this ticket that needs addressing?

mmokrejs commented 6 years ago

Could you capture the ValueError and raise the same type of error with a more meaningful message to the user? Or introduce some data checks before calling numpy altogether?

AntoineHo commented 4 years ago

Hello,

I got the same issue using kat plot spectra-mx -i ${name}-main.mx. I wanted to check the difference before and after filtering an illumina library. Of course, the second library (filtered) has 0 kmer exclusive to it and I believe the error comes from this. Is there a way to obtain the plot anyway with an absent peak for instance?

The kat comp output:

cat *.stats
K-mer statistics for:
 - Hash 1: "/0_READS/all_illumina_R1.GC_genoscope.fastq.gz"
 - Hash 2: "/2_QF/il_R1_length_quality_filtered.fastq.gz"

Total K-mers in:
 - Hash 1: 37124057731
 - Hash 2: 36092911747
Distinct K-mers in:
 - Hash 1: 4290345303
 - Hash 2: 3545643871

Total K-mers only found in:
 - Hash 1: 750285431
 - Hash 2: 0

Distinct K-mers only found in:
 - Hash 1: 744701432
 - Hash 2: 0

Shared K-mers:
 - Total shared found in hash 1: 36373772300
 - Total shared found in hash 2: 36092911747
 - Distinct shared K-mers: 3545643871

Distance between spectra 1 and 2 (all k-mers):
 - Manhattan distance: 7.50177e+08
 - Euclidean distance: 7.13742e+08
 - Cosine distance: 2.58283e-05
 - Canberra distance: 22.2294
 - Jaccard distance: 0.174741

Distance between spectra 1 and 2 (shared k-mers):
 - Manhattan distance: 6.2524e+07
 - Euclidean distance: 3.53003e+07
 - Cosine distance: 2.73752e-05
 - Canberra distance: 22.0042
 - Jaccard distance: 0.0174799

And the kat plot output:

kat plot spectra-mx -i all_v_filt-main.mx
Kmer Analysis Toolkit (KAT) V2.4.1
Traceback (most recent call last):
  File "/home/user/anaconda3/envs/kat/lib/python3.6/local/kat/plot/spectra_mx.py", line 223, in <module>
    main()
  File "/home/user/anaconda3/envs/kat/lib/python3.6/local/kat/plot/spectra_mx.py", line 169, in main                                                                    
    ymax[i] = np.max(peaky) * 1.1
  File "/home/user/anaconda3/envs/kat/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 2505, in amax
    initial=initial)
  File "/home/user/anaconda3/envs/kat/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 86, in _wrapreduction
    return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
ValueError: zero-size array to reduction operation maximum which has no identity
../lib/include/kat/pyhelper.hpp(159): Throw in function void kat::PyHelper::execute(std::string, int, char**)
Dynamic exception type: boost::exception_detail::clone_impl<kat::KatPythonException>
std::exception::what: std::exception
[kat::KatPythonError*] = Unexpected python error

Thank you. Cheers,

AH

bruvellu commented 4 years ago

Hi, I also had the same kat plot spectra-mx error as described by @AntoineHo using KAT V2.4.2.

setina42 commented 2 years ago

I had the same error, what has worked for me is setting the options X_MIN, Y_MIN, X_MAX and Y_MAX: kat plot spectra-mx -i matrix.mx -r 1 -s 1 -x 400 -y 200000 -o test , also using KAT 2.4.2