Repetitive mollusc genome interpretation

mason-linscott commented 2 years ago

I have had some troubles understanding my smudgeplot of a highly repetitive genome that has undergone LTR expansion recently. I counted 21-mers from 10x linked reads with the barcode information pruned from the R1 reads uisng KMC and then input that into smudgepairs. Coverage of the 10x data is approximately 60x.

smudgeplot.py plot 21kmc_L15_U2700_coverages.tsv -o "Land_snail" -t "O. idahoensis"

smudgeplot indicates the genome is a diploid which agrees with nQuire estimates on high map quality alignments. However, it also appears there is a substantial signal for triploidy. Is it possible this a signal of the recently expanded LTR elements? Or, is this artifact perhaps due to lower coverage than desired for the genome? I would appreciate hearing your interpretation of this plot.

All the best, M. Land_snail_smudgeplot

KamilSJaron commented 2 years ago

Hi M.,

10x is not the best for making smudgeplots, the coverage variation is a bit higher compared to other types of libraries (surprisingly even PacBio HiFi is better for kmer spectra analyses). So I think your analysis will be a bit more sensitive to the L parameter you chose. What did you use? And can you show us also a genomescope plot? That helps a lot to decide about a meaningful L.

I am a bit puzzled of the 1n coverage estimate, I am not 100% how smudgeplot ended up on that number... (the labels are not well overlaping with the smudges). But I also supect your L was too low and the bottom part of the smudgeplot is simply error kmers paired with the unique genomic kmers and then the upper smudge is the real diploid one (which would also estimate the 1n coverage a bit closer to the expected 30x, assuming that the 60x is what people call genome coverage, not per-haplotype coverage).

mason-linscott commented 2 years ago

Hi Kamil,

Thank you for looking into this! Here is the genomscope plot made using 31-mers (high K chosen due to genome repetitiveness). I was not sure whether to also apply a higher K for smudgeplot and just followed the guide. I hope it helps with your interpretation.

L was set to 15 and U to 2700 based on the output from the smudgeplot.py cutoff. I can raise L on the next smudgepairs run (takes a day and a half).

Thanks again, Mason

KamilSJaron commented 2 years ago

Hi Mason,

that is a giant genome. That must be a huge dataset you are dealing with.

So, the first thing first. Your sequencing is not the cleanest (which is understandable for giant mollusc genomes) - the error peak and genome peak are overlapping. The problem is that you don't have a clean cut between errors and real genomic kmers, which is what we want L to be.

Good news is that the genomescope looks alright (both het and homo zygous peaks are kind of visible and the model relativelly well explains the data). Also, you have very little evidence of polyploidy (assuming your 1n coverage fit is right), because those usually have some of the higher coverage peaks relatively high as well. So the only thing I would suggest to you is to redo the smudgeplot PLOT with a coverage prior from the genomescope model. You can run it as:

smudgeplot.py plot 21kmc_L15_U2700_coverages.tsv -n 18 -o "Land_snail" -t "O. idahoensis"

I am suspecting you will get the peaks annotated right this time (as AB and AABB). I suspect the proportion of AB vs AABB kmers will be a bit different and quite possibly favouring AABB. That practically means that the 2n peak kmers have very often a very similar kmers within the same 2n peak. It is a puzzling signature for genomes with high heterozygosity, but perhaps we should not be bothered by this that much given how much troubles we had separating 1n and error peaks. Perhaps just keep in mind for downstream analyses that you might have quite a lot of paralogy in your genome (and keep even deeper in your mind that there polyploidy was not conclusively rejected in this analysis).

KamilSJaron / smudgeplot

Repetitive mollusc genome interpretation #98