Open baozg opened 2 months ago
Hi,
can you please provide more information on:
The k-mer abundance histogram does not seem to look as expected for a diploid WGS sample, as there is no peak for heterozygous variants visible here. Are you just analyzing a specific region? Note that PanGenie is designed for whole-genome genotyping.
Thanks for fast response! Sorry for the lacking of background, I will explain more detail.
I am using A.thaliana 1001G reads (https://1001genomes.org/), it's a naturally inbreeding diploid species. So the distribution should like human CHM13 data, only one homozygous peak.
- Which species are you analyzing? Is this a whole genome sequencing data set?
A.thaliana data with 1135 WGS reads. But some of them quite old.
What are the exact command lines used for PanGenie? Due to the inbreeding nature, I convert the haploid vcf from pggb to the diploid count to fit in with
pangenie
singularity exec ~/software/pangenie/pangenie_v3.1.0.sif PanGenie-index -r TAIR10.fa -v PGGB_27genome.vcfbub.diploid.phased.vcf -o TAIR10_PGGB_27genomes singularity exec ~/software/pangenie/pangenie_v3.1.0.sif PanGenie -f TAIR10_PGGB_27genomes -o 6909.pangenie.vcf -s 6909 -i 6909.PE.fasta -j 12 -t 12 -u
From the plot, it seems that there are no k-mers with count 1 in your read data, why is this the case? What exactly is plotted here?
I more think this as the data issue, although it already trimmed by trimmoatic. From the total bases, the 20 coverage are more make sense to me.
Thanks for providing more details!
Just to be sure, the histogram shows a single genome, not all 1135 at once, correct? If yes, then it looks a bit odd to me. There seems to be a lot of k-mers with small counts, and the homozygous peak is very weak. PanGenie looks for the largest peak in the histogram, which normally should be the homozygous peak. However, here, the peak on the very left (at 2) is the biggest one, that's why it is picked instead of the homogous one at 18.
I don't have any experience with A.thaliana data, but from what you shared, it seems that the sequencing error rate is very high (which could explain why the plot looks like this), is this expected? These are Illumina reads, correct?
Thanks for your explanation! Yes, this is one sample historgram. Most of samples' historgram looks normal. All of them are Illumina reads. How would it influence the pangenie performance? For the inbreeding speceis, usually one peak could be detected. Is that will influence the performance? Check the het and homozygous peak as 1:2 may could correct this for some low coverage or high error profile reads.
I think this historgram may come from specific reads and I will invesitgate more on the raw data.
Hi, @eblerjana
I found that several samples's peak is wrong according the pangenie log. So would you mind teliing what's the influence of these peak? Is there any way to set peak?
Here is a example. The reason could be the data is from ~2010.