Interpreting smudgeplot output

eweinheimer commented 1 year ago

Hello,

Just hoping to get some insight into these smudgeplots that I have generated based on Illumina HiSeq data. These are for two closely related tree species. I've also attached the genomescope plots, for which I ran diploid and tetraploid models. Previous karyotyping studies have shown populations of these species to be diploid, triploid, tetraploid, and octaploid, though tetraploid and octaploid are the most frequently observed ploidy levels. 2C DNA content across these studies as well as our own seem to be inconsistent, but based on our flow cytometry estimates, the haploid genome size should be ~500Mbp. Haploid chromosome number is 13.

I am trying to reconcile these findings from the literature with what I'm seeing in these plots. I've fiddled around with the kmer length and that has changed the estimated heterozygosity and genome size, but ploidy is still predicted diploid. I'm having trouble being convinced of that, though, and I see from the information on this page that the model can sometimes be wrong for a variety of reasons, some of which we may be dealing with here. I'd be curious to hear your interpretation of these plots and any suggestions you may have for teasing apart this issue further. Mainly hoping to see if we can interpret anything about genome size, heterozygosity, or auto vs. allopolyploidy here.

Thank you in advance! Species1: tor_smudge tor_gscope4 tor_gscope2 Species2: rob_smudge rob_gscope4 rob_gscope2

KamilSJaron commented 1 year ago

Hi,

I will give you a short answer first, and I hope I will be able to get back to you in a month or so with a lot better solution.

I think you have allotetraploid. The A <-> B divergence is already very substantial, so smudgeplot has troubles picking up the AABB signal. The thing is, Gene Myers managed to find a logical mishap we left in smudgeplot, that mishap is causing that there is a greater "drop" of the higher ploidy k-mers to the lower-ploidy brackets (namely, when there are too many overlaping variants in polyploids). Which is likely the reason, why Smudgeplot is telling you diploid. If you would like to try Genes althorithm, you can try https://github.com/thegenemyers/MERQURY.FK. But we are working on merging the two tools together, so bear with me!

Sorry for not describing it here in more detail, the explanation is very nuanced, I need to write it up properly.

But in anyway, all polyploid, and applied twice for allotetraploid, will over time go from AABB or AAAB signal, towards AB signal, as the non-recombining homoelogous genomic copies diverge. Those cases are actually the border cases we should probably call degenerated tetraploids.

I think you can quite safely use the tetraploid genomescope model, it quite nicely extimates the AABB structure.

eweinheimer commented 1 year ago

Thanks so much, this explanation was very helpful. I will definitely look into using Gene Myers' program and look forward to hearing more from you when you have the chance. In the meantime, I will proceed with my analysis assuming allotetraploidy.

KamilSJaron commented 1 year ago

@weinei18 we now have a beta-version working with PloidyPlot backend and Smudgeplot front-end. If you would like to give it a try, I can send you the instructions how to get started.

KamilSJaron commented 1 year ago

I will also close this issue for now, but do get in touch in case of anything

eweinheimer commented 1 year ago

@KamilSJaron Yes, I would be very interested to try the beta version!

KamilSJaron commented 1 year ago

Excellent, presumably you already have smudgeplot, but if you don't, download the repository

git clone https://github.com/KamilSJaron/smudgeplot.git

then pull also the development branch

git pull origin sploidyplot

Now you downloaded the beta-version. There is a readme file with installation instructions and everything you need to know to run the beta-version (hopefully, it's beta after all). Let me know if it works for you.

KamilSJaron commented 1 year ago

I reopened the issue and added it to project directory, so once you manage to get the new version plots, we can compare them here.

eweinheimer commented 1 year ago

@KamilSJaron got it working! Didn't have any issues running it, other than those due to my own failure to install things properly. Here are the updated plots for Species 1 for a haploid coverage of 50x and the commands I ran to get there.

FastK -v -t4 -k21 -M100 -T56 JP-Vtor-GenomeSR1_R1_001_val_1.fq.gz JP-Vtor-GenomeSR1_R2_001_val_2.fq.gz -NVtor_FastK_Table
PloidyPlot -e10 -v -T56 -oVtor_kmerpairs Vtor_FastK_Table
smudgeplot/exec/smudgeplot.py plot -n 50 -t Vtor -o Vtor_smudge2.0 Vtor_kmerpairs_text.smu

KamilSJaron commented 1 year ago

Oh my gosh, thank you so much for this!!! They look amazing;

I recently found out the AB are possibly a tiny bit misplaced, I will be pushing more changes soon.

Biologically speaking, it is interesting you still have the AB smudge so strong, the smudgeplot clearly indicate a complicated genome structure; it has by far too many dups for a diploid, but too many "diploid" loci for a tetraploid, it could be as well a degenerated tetraploid or somerthing even more complex. Does look interesting! WHat's the species?

eweinheimer commented 1 year ago

Certainly! I will continue to check back for the new changes. I'm not able to share the species name at this time, but I can say this... Our current theory is that the ancestor of the clade was a hybrid. Old cytogenic studies and our genomescope/smudgeplot analyses show higher ploidies and weird patterns in many species within this particular genus, at least on a continental scale, and sister clades appear to be almost exclusively diploid. Divergence time is estimated to be 20-40Mya, so the degenerated tetraploid theory could fit quite nicely. Still a lot to be teased apart because there is very little information out there about them.

Thank you for your insight, it has given us a lot to consider. Very interesting indeed!

KamilSJaron / smudgeplot

Interpreting smudgeplot output #117