Wrong coverage estimate?

jgnunes commented 1 year ago

Hi Kamil,

First of all, thanks for developing on such a great tool.

I've been trying to run Smudgeplot on a coral genome. The species (Tubastraea tagusensis) was assumed to be diploid according to karyotype experiments. So I took the complete whole genome Illumina sequencing data and ran GenomeScope (all commands at the end of the issue), which gave me an estimated haploid coverage of 152x:

However, Smudgeplot infers a much lower (39x) coverage and raises a warning message:

#############

SUMMARY

############# !! Warning, your coverage filter on the lower end (L = 30 ) is higher than half of the 1n coverage estimate ( 1n / 2 = 19.55 If the real 1n coverage is half of your estimate you would not picked it up due to the filtering. If you have sufficient coverage, consider reruning the analysis with lower L (something like (1n / 2) - 5) One good way for verificaiton would be to compare it to GenomeScope estimate of haploid coverage

As suggested, I've tried running Smudgeplot with a lower L (L=15), but this time the estimated coverage was even lower (~ 20x) and one more time the warning message was raised, asking me to decrease the value of L again:

#############

SUMMARY

############# !! Warning, your coverage filter on the lower end (L = 15 ) is higher than half of the 1n coverage estimate ( 1n / 2 = 9.75 If the real 1n coverage is half of your estimate you would not picked it up due to the filtering. If you have sufficient coverage, consider reruning the analysis with lower L (something like (1n / 2) - 5) One good way for verificaiton would be to compare it to GenomeScope estimate of haploid coverage

So the questions are:

Could you make sense of this apparent inconsistency between Smudgeplot and Genomescope coverage estimates? I'm afraid we can't trust on Smudgeplot results until this inconsistency is solved for my data, do you agree? Is there anything else you would suggest me to try?

I was using the latest bioconda Smudgeplot version (v0.2.5) and here are the commands run:

K-mer counting and Genomescope:

# count kmers
kmc -k21 -t16 -m200 -ci1 -cs10000 @FILES kmer_counts tmp
# get histo from kmers count
kmc_tools transform kmer_counts histogram kmer_k21.hist
# run genomescope
Rscript ~/my_files/softwares/genomescope2.0/genomescope.R -i kmer_k21.hist -k 21 -o . -n Ttagusensis_genomescope

Smudgeplot:

# filtering kmers based on coverage (L=30, U=2000)
kmc_tools transform kmer_counts -ci30 -cx2000 reduce kmcdb_L30_U2000
# build heterozygotic kmer pairs
smudge_pairs kmcdb_L30_U2000 kmcdb_L30_U2000_coverages.tsv kmcdb_L30_U2000_pairs.tsv > kmcdb_L30_U2000_familysizes.tsv
# running Smudgeplot
smudgeplot.py plot kmcdb_L30_U2000_coverages.tsv

Looking forward to hearing from you, Thanks

KamilSJaron commented 1 year ago

Yep, indeed smudgeplot underestimated a lot the coverage (not 100% why, but I suspect it will have something to do with how high coverage you actually have).

Would you try to help Smudgeplot by indicating your coverage expecation? Something like...

smudgeplot.py plot -n 150 kmcdb_L30_U2000_coverages.tsv

But indeed, this is not optimal. I will keep this in mind when refining the algorithm

jgnunes commented 1 year ago

Thanks for the quick response, @KamilSJaron. I've rerun smudgeplot using the suggested command and it finished successfully:

However I still have two doubts:

There's still a warning message being generated, but I think I should not worry about it. What do you think?

#############

SUMMARY

############# !! Warning, the two types of estimates of 1n coverage differ a lot ( 39.1 and 152.864478768113 ) i.e. at least of one of the smudgeplot methods to estimate the haploid coverage got it wrong Using subset estimate instead of highest peak estimate (less precise but also less often completely wrong) Does the smudgeplot look sane? Is at least one of the 1n estimates close a GenomeScope estimate? You can help us imrove this software by sharing this strange smudgeplot on https://github.com/KamilSJaron/smudgeplot/issues.

Smudgeplot predicts my genome to be triploid, while the Genomescope profile infers p:2 and previous karyotyping experiments made by our team had shown a diploid pattern. Do you think Smudgeplot may be wrongly inferring the ploidy here or any idea on how to explain the discrepancy between Smudgeplot and Genomescope?

Best,

KamilSJaron commented 1 year ago

Ha, this is EXCATLY why Smudgeplot is useful for interpretation of GenomeScope plots!

Let's start with your first worry - the warning is there, because we have two different strategies to estimate 1n coverage, and on your a bit messy dataset (a bit messy in sense that you have plenty of k-mers that are some sort of contaminants or errors with fairly high coverage) the methods fail, it needed your input to get it right. HOWEVER, you, as a user and genomicist should know 1. what is the ballpark of your coverage 2. be able to tell which of the predicted numbers makes sense. THere is only one truth, and I think in your case it's fairly obvious what the truth is! 152x by smudgeplot is perfectly matching the 152 by genomescope, the 39 is a lot less than what one would expect given your k-mer profile (but Smudgeplot does not see your k-mer profile so is unable to make that judgement for you). In anyway, this warning is to ask users to think what is possible and if by any chance the other estimate would be more meaningful given the other data, but again, that is def not your case.
GenomeScope DOES NOT predict any ploidy, genomescope fits the ploidy you specify by the parameter -p, with -p 2 being the default. I would recommend to rerun the model with -p 3, you will get a better fit. And moreover - this genome IS nearly certainly a triploid. That is one of the few genomic signatures that are very easy to pick up with smudgeplot and is unmistakenable (at least unmistakenable with diploid, in theory it could be very homozygous hexaploid, but that would be extremly unlikely givn your data). Did you karyotype the same individual you have sequenced? If not, is there a chance of ploidy variation in this species? Perhaps a different, but less likely explanation is if the coral would be mixture of multiple "individuals" in a 1:2 ratio. That does not fit very well the data because the stochiometry of the peaks is very close the 1:2:3 expecation, and the AAB smudge has both the expect A+B coverage and B / (A + B) ~ 0.33 ratio. There was a wolf lichen I have while back were we suspected the mixed haplotype scenario, but there the data was a bit different (and also the individual cells were suspected monoploids, not diploids).

I hope my stream of consciousness is understandable.

jgnunes commented 1 year ago

Hi @KamilSJaron

The original issue was definetely solved, so I'm closing it, thanks a lot!

Regarding the detected triploidy, I need to go over some lab documentation and literature to answer your questions ("Did you karyotype the same individual you have sequenced? If not, is there a chance of ploidy variation in this species?"), but again, since this goes beyond the original issue, I'm closing it.

Best regards,

KamilSJaron / smudgeplot

Wrong coverage estimate? #131

SUMMARY

SUMMARY

SUMMARY