Open mmokrejs opened 6 years ago
Hi @mmokrejs, Looks like something went wrong with your command. Was the $ symbol in front of pe2 intentional? It seems to be picking up a third dataset from somewhere called: tt_16D1C3L12__abyss_128-scaffolds.fa
. Possibly you meant to put in a command like this:
kat comp -t $threads -o pe_vs_assembly 'pe1.fasta pe2.fasta' tt_16D1C3L12__abyss_128-scaffolds.fa
So I've just run this command on your .mx file:
kat plot density -o test1 pe_vs_assembly-main.mx
This produces a file called test1.png:
So the data looks like your two datasets are of the same sample and have a 27-mer coverage peak of ~45X. There's also no significant bias which is probably to be expected as if I understand the file naming correctly this is just the R1 and R2 of the same dataset. It also looks like the genome only has a small amount of heterozygosity as well. Was this what you were expecting for the dataset?
Was the $ symbol in front of pe2 intentional?
Ah, sorry, I was hand-editing the github post to make file paths shorter and you are right, the $
should have not been there. There was a full file path instead.
kat comp -t $threads -o pe_vs_assembly 'pe1.fasta pe2.fasta' tt_16D1C3L12__abyss_128-scaffolds.fa
If I was supposed to group the two different PE library pre files into one group and compare it to the assembly, then you are right.
If I can provide two different PE library files (interleaved fwd and rev reads in each) versus assembly, then I would omit the groupping.
What I expected? Truly said, I tried to follow docs and see what happens.
I understand the file naming correctly this is just the R1 and R2 of the same dataset
No, those are two different library preps and sequencing (see the flowcell-names and lane numbers in the filenames).
It also looks like the genome only has a small amount of heterozygosity as well.
Where is the genome represented in the plot? I see only PE1 vs. PE2 being plot. Is in the the scalebar on the right side of the plot?
The coverage at k=27
seems somewhat low. Compare with GenomeScope results (actually the k-mer counting is from ntCard) for various k-mer sizes, notable the k=32
is closest to k=27
:
tt_16D1C3L12PE-onlyntCard_k32.histo http://genomescope.org/analysis.php?code=85k6B5aH7tu2vRt7JNMp
tt_16D1C3L12PE-onlyntCard_k64.histo http://genomescope.org/analysis.php?code=BfNXYUjshZ8zWFk8JUf2
tt_16D1C3L12PE-onlyntCard_k96.histo http://genomescope.org/analysis.php?code=pJuy0fKxb5nXns6ynp4M
tt_16D1C3L12PE-onlyntCard_k112.histo http://genomescope.org/analysis.php?code=uIKzl6JvNnoeMPnRQlU6
tt_16D1C3L12PE-onlyntCard_k128.histo http://genomescope.org/analysis.php?code=HBnsyGEI1MNQDu1sB3fW
tt_16D1C3L12PE-onlyntCard_k144.histo http://genomescope.org/analysis.php?code=2ZkJ2eyTW83ZdmDl98mk
tt_16D1C3L12PE-onlyntCard_k156.histo http://genomescope.org/analysis.php?code=jXyExLoKD4kb5whTfgnO
Hi @mmokrejs, Yes, so in the plot it is treating the coverage of each file separately, so this is exactly half of the genomescope plots, which you show. The y axis in the genome scope plots correlates with the z axis (colour) in KAT.
I think this is correct based on what you asked KAT to do in this instance: compare dataset1 against dataset2 (rather than show the histogram of dataset1 + dataset2 combined). If you want a direct comparison you can change the -m
option in KAT to 32 and run kat hist
using both files as input.
And how do I get PE1+PE2 on X-axis and assembly-based k-mers on Y-axis? ... to see what is missing or what is redundant in the assembly?
So yes, if you run something like this:
kat comp -m32 -t $threads -o pe_vs_assembly 'pe1.fasta pe2.fasta' tt_16D1C3L12__abyss_128-scaffolds.fa
Then you will get a png file generated that shows a stacked histogram that will have a profile similar (hopefully identical) to what you see with genomescope, but will be divided according to the frequency at which those k-mers are found in the reference.
This will also generate a matrix file, so if you're not happy with the auto generated ranges and labels, you can modify using a command something like this: kat plot spectra-cn -t <newtitle> -a <new x label> -b <new y label> -x <new x max range> -y <new y max range> -o newoutput pe_vs_assembly.mx
kat comp -m32 -t $threads -o pe_vs_assembly 'pe1.fasta pe2.fasta' tt_16D1C3L12__abyss_128-scaffolds.fa
is yet another example hitting deemed Out-of-memory issue, like https://github.com/TGAC/KAT/issues/100
So you could try using only one of your read files rather than both together. This will reduce memory requirements somewhat. The other option is to reduce your kmer size. If you drop to something like -m23
this will help a lot, at the expense of reducing the uniqueness of each kmer.
Hi @mmokrejs, is there anything outstanding on this ticket that needs addressing?
Could you capture the ValueError
and raise the same type of error with a more meaningful message to the user? Or introduce some data checks before calling numpy altogether?
Hello,
I got the same issue using kat plot spectra-mx -i ${name}-main.mx
. I wanted to check the difference before and after filtering an illumina library. Of course, the second library (filtered) has 0 kmer exclusive to it and I believe the error comes from this. Is there a way to obtain the plot anyway with an absent peak for instance?
The kat comp output:
cat *.stats
K-mer statistics for:
- Hash 1: "/0_READS/all_illumina_R1.GC_genoscope.fastq.gz"
- Hash 2: "/2_QF/il_R1_length_quality_filtered.fastq.gz"
Total K-mers in:
- Hash 1: 37124057731
- Hash 2: 36092911747
Distinct K-mers in:
- Hash 1: 4290345303
- Hash 2: 3545643871
Total K-mers only found in:
- Hash 1: 750285431
- Hash 2: 0
Distinct K-mers only found in:
- Hash 1: 744701432
- Hash 2: 0
Shared K-mers:
- Total shared found in hash 1: 36373772300
- Total shared found in hash 2: 36092911747
- Distinct shared K-mers: 3545643871
Distance between spectra 1 and 2 (all k-mers):
- Manhattan distance: 7.50177e+08
- Euclidean distance: 7.13742e+08
- Cosine distance: 2.58283e-05
- Canberra distance: 22.2294
- Jaccard distance: 0.174741
Distance between spectra 1 and 2 (shared k-mers):
- Manhattan distance: 6.2524e+07
- Euclidean distance: 3.53003e+07
- Cosine distance: 2.73752e-05
- Canberra distance: 22.0042
- Jaccard distance: 0.0174799
And the kat plot output:
kat plot spectra-mx -i all_v_filt-main.mx
Kmer Analysis Toolkit (KAT) V2.4.1
Traceback (most recent call last):
File "/home/user/anaconda3/envs/kat/lib/python3.6/local/kat/plot/spectra_mx.py", line 223, in <module>
main()
File "/home/user/anaconda3/envs/kat/lib/python3.6/local/kat/plot/spectra_mx.py", line 169, in main
ymax[i] = np.max(peaky) * 1.1
File "/home/user/anaconda3/envs/kat/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 2505, in amax
initial=initial)
File "/home/user/anaconda3/envs/kat/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 86, in _wrapreduction
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
ValueError: zero-size array to reduction operation maximum which has no identity
../lib/include/kat/pyhelper.hpp(159): Throw in function void kat::PyHelper::execute(std::string, int, char**)
Dynamic exception type: boost::exception_detail::clone_impl<kat::KatPythonException>
std::exception::what: std::exception
[kat::KatPythonError*] = Unexpected python error
Thank you. Cheers,
AH
Hi, I also had the same kat plot spectra-mx
error as described by @AntoineHo using KAT V2.4.2
.
I had the same error, what has worked for me is setting the options X_MIN, Y_MIN, X_MAX and Y_MAX: kat plot spectra-mx -i matrix.mx -r 1 -s 1 -x 400 -y 200000 -o test
, also using KAT 2.4.2
In https://github.com/TGAC/KAT/issues/94#issuecomment-385328131 I mentioned an issue with directly calling:
pe_vs_assembly-main.mx.zip
They were created by: