KamilSJaron / smudgeplot

Inference of ploidy and heterozygosity structure using whole genome sequencing data
Apache License 2.0
223 stars 24 forks source link

Interpretation of smudgy smudgeplot for diatom genome #108

Closed sarahfrail closed 1 year ago

sarahfrail commented 1 year ago

My group is working on sequencing the genome of a freshwater diatom, one member of a large class of unicellular algae. We have no previous knowledge or indication of its ploidy, though most published diatom genomes are diploid. This genome is more repetitive (>60%) and the assembly size (~500Mb) nearly twice as large as any previously published genome. This repetitiveness and large size of the assembly (in addition to genomescope evidence and individual inspection of k-mer alignments to our genome) made us think there might be some polyploidy at play.

We have been doing the assembly with both nanopore and PE150 illumna reads. Recently we have been running Genomescope and Smudgeplot on our Illumina reads.

Illumina data has been trimmed, deduplicated, and filtered for Q>20, k-mers counted with jellyfish then ran the following lines in smudgeplot. I have generated plots for a range of k-mers from 13-31 but here I will include plots for k=21.

jellyfish dump -c -L 45 -U 1000 My_PE_Illumina_data_21mers.jf | smudgeplot.py hetkmers -o My_21mer_pairs_L45U1000
smudgeplot.py plot My_21mer_pairs_L45U1000_coverages.tsv -o My_21mer_pairs_L45U1000 -k 21

Resulting log10 smudgeplot looks like this. No warnings.

21mer_smudge_linearandlog

Most smudgeplots for k >= 19 are proposed diploid. For smaller k and and if I plot with -q 0.9, I often get proposed tetraploid.

Here is the 21-mer genomescope:

21mer_linearandlog

Our interpretation of the genomescope (which we ran before smudgeplot) was that our organism is potentially tetraploid, possibly even hexaploid or octoploid when looking at the smaller, higher coverage bumps in the log plot. In addition, the log of the smudgeplot seems to my eyes to have smudges present at 6, 7 and 8N in the log plot that aren't detected and labeled. Could we be having a similar diploid prediction of a highly heterozygous polyploid genome, similar to the allohexaploid wheat example given in the paper?

Any advice on reconciling the genomescope and smudgeplot, or just rectifying misunderstandings we have about interpreting these plots, is much appreciated.

Thanks for for this tool and genomescope! It's been a great help

KamilSJaron commented 1 year ago

Very very cool. I am glad you have such a beautiful sequencing run of diatoms!

It really is beautiful, it can't get much better than this - all your peaks are very clearly separated and there seems to be only very little of any background noise. That makes me quite confident that your true 1n coverage is indeed ~100x. Not Genomescope or smudgeplot gave you any indication it could be half of it, so this is something you can be fairly sure about, so let's proceed assuming you got the coverage right.

There are 4 clear peaks in GenomeScope - 1n and 2n peaks are really large, 3n and 4n peaks are rather small. Usually, "the ploidy peak" is the highest one, but there are exceptions to this rule - highly heterozygous species have 1n peak higher, and degenerated tetraploids (usually allotetraploids) can have their 4n peak quite a bit smaller than 1n and 2n peaks. However, just considering genomescope, my naive guess would be diploid with some duplications.

I really like to look at Smudgeplot in these cases, it tells us something about relationship of the individual genomescope peaks - are the any similar sequences in any other peaks? Which ones? In you case it seems that majority of similar sequences are pairing 1n - 1n peak (expected if the species is diploid). Furthermore, you have a big array of smudges that are weakly pairing all possible peaks (nicely showing up on log scale) - this makes me think it's probably not a single event that generated them (it would be less gradual), so my best bet would be that this is just a genome with loads of duplication (perhaps some lazy transposons?).

I think you have a great idea about what to expect form the genome now, I would try to assemble it and run KAT or Mercury. I am sure it will make all sense and if it is a tetraploid in the end (my intuition can be wrong), you should not have problems assembling the homoelog sequences separately because they would be substantially diverged.

Hope this helps :-)

P.S. One of my pandemic project was to do a shirt with Heckel's illustration of diatoms image

sarahfrail commented 1 year ago

Hi Kamil,

Thanks so much for your detailed response! We've spent the last several months wrapping our heads around the data in lots of different ways.

We're fairly sure that our Merqury results are consistent with diploid, and now we think we must just have some interesting transposon related duplication like you suggested. Soon, I will try to make some phylogenies from the repeats to try to get at this question. We may yet try to perform a haploid resolved assembly, but we are somewhat limited by sequencing depth and read length, so we've yet to see if this is viable. We also now think our genome is closer to 80% repetitive, which introduces further complications. These organisms are so fun and interesting, but sometimes difficult to work with!

The shirt is beautiful! Thank you for sharing. The diatoms really are living art pieces.