eblerjana / pangenie

Pangenome-based genome inference
MIT License
114 stars 10 forks source link

Wrong kmer abundance peak #86

Open baozg opened 2 months ago

baozg commented 2 months ago

Hi, @eblerjana

I found that several samples's peak is wrong according the pangenie log. So would you mind teliing what's the influence of these peak? Is there any way to set peak?

Here is a example. The reason could be the data is from ~2010.

image
  42 Read 796654 variants from provided UniqueKmersMap archive.
     43 Count kmers in reads ...
     44 Histogram peaks: 2 (4067603), 18 (2206323)
     45 Computed kmer abundance peak: 2
     46 Determine read k-mer counts for unique kmers ...
     47 Warning: using 5 for determining unique kmers.
     48 Sampled 1 subset(s) of paths each of size 55 for genotyping.
     49 Construct HMM and run core algorithm ...
     50 Warning: using 5 for genotyping.
     51 Write results to VCF ...
eblerjana commented 2 months ago

Hi,

can you please provide more information on:

  1. Which species are you analyzing? Is this a whole genome sequencing data set?
  2. What are the exact command lines used for PanGenie?
  3. From the plot, it seems that there are no k-mers with count 1 in your read data, why is this the case? What exactly is plotted here?

The k-mer abundance histogram does not seem to look as expected for a diploid WGS sample, as there is no peak for heterozygous variants visible here. Are you just analyzing a specific region? Note that PanGenie is designed for whole-genome genotyping.

baozg commented 2 months ago

Thanks for fast response! Sorry for the lacking of background, I will explain more detail.

I am using A.thaliana 1001G reads (https://1001genomes.org/), it's a naturally inbreeding diploid species. So the distribution should like human CHM13 data, only one homozygous peak.

  1. Which species are you analyzing? Is this a whole genome sequencing data set?

A.thaliana data with 1135 WGS reads. But some of them quite old.

  1. What are the exact command lines used for PanGenie? Due to the inbreeding nature, I convert the haploid vcf from pggb to the diploid count to fit in with pangenie

    singularity exec ~/software/pangenie/pangenie_v3.1.0.sif PanGenie-index  -r TAIR10.fa -v PGGB_27genome.vcfbub.diploid.phased.vcf -o TAIR10_PGGB_27genomes
    singularity exec ~/software/pangenie/pangenie_v3.1.0.sif PanGenie -f TAIR10_PGGB_27genomes -o 6909.pangenie.vcf -s 6909 -i 6909.PE.fasta -j 12 -t 12 -u
  2. From the plot, it seems that there are no k-mers with count 1 in your read data, why is this the case? What exactly is plotted here?

I more think this as the data issue, although it already trimmed by trimmoatic. From the total bases, the 20 coverage are more make sense to me.

eblerjana commented 2 months ago

Thanks for providing more details!

Just to be sure, the histogram shows a single genome, not all 1135 at once, correct? If yes, then it looks a bit odd to me. There seems to be a lot of k-mers with small counts, and the homozygous peak is very weak. PanGenie looks for the largest peak in the histogram, which normally should be the homozygous peak. However, here, the peak on the very left (at 2) is the biggest one, that's why it is picked instead of the homogous one at 18.

I don't have any experience with A.thaliana data, but from what you shared, it seems that the sequencing error rate is very high (which could explain why the plot looks like this), is this expected? These are Illumina reads, correct?

baozg commented 2 months ago

Thanks for your explanation! Yes, this is one sample historgram. Most of samples' historgram looks normal. All of them are Illumina reads. How would it influence the pangenie performance? For the inbreeding speceis, usually one peak could be detected. Is that will influence the performance? Check the het and homozygous peak as 1:2 may could correct this for some low coverage or high error profile reads.

I think this historgram may come from specific reads and I will invesitgate more on the raw data.