KamilSJaron / smudgeplot

Inference of ploidy and heterozygosity structure using whole genome sequencing data
Apache License 2.0
227 stars 24 forks source link

Unexpected Polyploidy #92

Closed jhcaddisfly closed 2 years ago

jhcaddisfly commented 2 years ago

I have troubles understanding my smudgeplots for two individuals of the same species. I have used v0.2.3 and the following command to generate the plots:

I first extracted genomic kmers using coverage thresholds which were estimated from the kmer histograms generated by jellyfish with the internal smudgeplot script smudgeplot.py as follows: 
* = ind1 / ind2
L=$(smudgeplot.py cutoff *_k21.hist L)
U=$(smudgeplot.py cutoff *_k21.hist U)
These were 21 (individual 1) / 23 (individual 2) and 690 (individual 1) / 790 (individual 2).
Then, I extracted kmers in the coverage range from L to U  using jellyfish dump -c -L 41 -U 2400 *_kmer_counts.jf -o *_jfkmers
I then used smudgeplot.py hetkmers -o * *_jfkmers to compute the set of kmers and 
smudgeplot.py plot -i *_kmer_pairs_coverages.tsv -o * to plot these.

and it look like this:

Now, I have indication already of genome size from genomescope and flow cytometry (651.30 Mbp but a different individual of the same species). The Genomescope results look like this:

http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=b7yzvd5lbmeuMEEx5swf

http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=e68aubTXZEgN9Ix2u9td

ind2_smudgeplot_log10 ind2_smudgeplot ind1_smudgeplot_log10 ind1_smudgeplot This does not make sense together with the smudge because it predicts unexpected ploidy.

How should I understand my smudgeplot?

Thanks, J.

KamilSJaron commented 2 years ago

Hello J.,

not sure what's the question here. I presume that you have a diploid predicted to be a tetraploid.

What what I can tell, there is only very little heterozygosity in the species (looking at the genomescope estimates). Perhaps it's the same problem as in the diploid strawberry case which also looked tetraploid because the paralogy signal dominated the smudgeplot due to very low heterozygosity: https://github.com/KamilSJaron/smudgeplot/wiki/tutorial-strawberry

jhcaddisfly commented 2 years ago

Hello,

Thank you very much for your quick response! Yes, my question was why smudgeplot predicts my studied organism to be tetraploid since I would expect it to be diploid because of the haploid genome size estimate from FCM and Genomscope2. To conclude, can I argue that because of the low heterozygosity (0.35-0.39%) the duplication signal is relatively stronger than the heterozygosity signal)? So, are the duplications rather recent? Can I say that my organisms have a lot of closely related paralogs (paralogs: AABB 81% / 69% vs. heterozygous loci: AB 19% / 17%) and that this is cause because smugeplot picks two homozygous loci that are exactly one nucleotide different up as AABB?
So, do you think the evidence for tetraploidy according to smudgeplot is rather low?

Sorry for asking so many questions. I just want to make sure I understand the results correctly. I really like smudgeplot! It is super helpful and I am happy I came across this tool!

Thanks, J.

KamilSJaron commented 2 years ago

To conclude, can I argue that because of the low heterozygosity (0.35-0.39%) the duplication signal is relatively stronger than the heterozygosity signal)?

Yeah, that's a documented problem of very homozygous genomes.

So, are the duplications rather recent? Can I say that my organisms have a lot of closely related paralogs

We don't really infer the evolutionary history, so it's hard to say if these are paralogs or other types of *logs, nor we date them. Although it probably is due to recent paralogs, all we can tell is there are plenty of close-but-not-identical duplications in the genome. So yes, but you might want to be careful about phrasing.

that this is cause because smugeplot picks two homozygous loci that are exactly one nucleotide different up as AABB?

Smudgeplot picks all the kmers distant by 1 nt and by projecting them on a A + B and A / (A + B) plane it determines their copy numbers / relative counts. The interpretation of AABB depends on your genome, what it means that there are plenty of closely related kmers that are both in two copies in the genome. If the genome is diploid, they are most likely two homozygous loci distant by 1 nt, but in theory also could be 2 heterozygous loci that are perfect duplicates with the same genotype (I guess extremely unlikely, but what do I know).

I think you mostly got this right now, no worries :-)

K