Open margaretc-ho opened 1 year ago
To try to address the question of how read data from a haploid organism would look, we also took short read data from Cryptosporidium (known to be haploid) and we also ran these through Genomescope and Smudgeplot.
This is what the Crypto data shows for p=1 in GenomeScope2.0
This is what the Crypto data shows for p=2 in Genome Scope2.0
If this is known to be haploid, I am a bit worried about the fact that the error rate is lower for p=2 than it is for p=1 model in Genomescope2.0. Any thoughts about this?
And the smudgeplot for Crypto: (same commands as were run for TC and TM in the analyses above)
Running the crypto smudgeplot also yielded a similar warning about coverage filter L
detecting two smudges at the same positions, not enough data for this number of bins lowering number of bins to 35
detecting two smudges at the same positions, not enough data for this number of bins lowering number of bins to 30
detecting two smudges at the same positions, not enough data for this number of bins lowering number of bins to 25
detecting two smudges at the same positions, not enough data for this number of bins lowering number of bins to 20
detecting two smudges at the same positions, not enough data for this number of bins lowering number of bins to 18
detecting two smudges at the same positions, not enough data for this number of bins lowering number of bins to 16
detecting two smudges at the same positions, not enough data for this number of bins lowering number of bins to 14
!! Warning, your coverage filter on the lower end (L = 12 ) is higher than half of the 1n coverage estimate ( 1n / 2 = 7.8
If the real 1n coverage is half of your estimate you would not picked it up due to the filtering.
If you have sufficient coverage, consider reruning the analysis with lower L (something like (1n / 2) - 5)
One good way for verificaiton would be to compare it to GenomeScope estimate of haploid coverage
Would appreciate any thoughts you have about these runs!
I will also add that we did run on diploid/aneuploid organism (Leishmania) and the smudgeplot for this one was quite cool.
Hi,
I am hurry preparing for the GenomeScope workshop tomorrow, in short.
TM
- is either super homozygous diploid, or (more likely) haploid. The quality of genomescope fit is not necessarily the best indicator of which model is the biologically more relevant if the quality of fit differs just marginally. It's more like that if you see the fit is really bad you know the model is just wrong (in one way or another). So, I think this is much as you can get out of profiling of this species.
The TS
does not really show much. As you say, the problem is the L threshold is too low. The -e12
in PloidyPlot is the -L parameter in smudgeplot (that is one of the reasons why I created the smudgeplot.py hetmers
wrrapers that uses parameters consistent with previous versions). So, I would suggest something like PloidyPlot -e70 -k -v -T16 Tc_gIlluminaFastK_Table
The Crypto data is a lot more messy, it's really hard to say if that's because there is no clear separation of peaks in the k-mer spectra.
Leishmania does look quite funky. By the way, smudgeplot does look nice, and shows quite clearly that the gneomescope model is wrong, I would try to rerun it and with a coverage prior (what smudgeplot suggested) and perhaps trying both diploid and tetraploid genomes.
Thanks for the quick reply! It is interesting to hear your thoughts on TM being potentially haploid or fairly homozygous diploid, I think the collaborators will be interested to hear and follow up with ploidy experiments with both TM and TC. Thanks for the explanation of L a bit better, I think I can choose appropriate L a little better now.
I did fiddle around and rerun TC and leishmania last week, here are the results:
TC rerun with -e70 parameter, it does look intriguing. It ran without warnings
Running Genomescope p=3 and p4 out of curiousity on TC: p=3
p=4
I suspect the larger smear/peak of higher coverage k-mers is making it hard to resolve the ploidy of TC? Any thoughts?
For Leish which as been reported to have a LOT of aneuploidy, I think these Genomescope plots are definitely showing higher ploidy is possible p=3
p=4
Smudgeplot with L=7
It does seem like choosing L is tricky, this was the warning I got from running L=7
detecting two smudges at the same positions, not enough data for this number of bins lowering number of bins to 35
detecting two smudges at the same positions, not enough data for this number of bins lowering number of bins to 30
detecting two smudges at the same positions, not enough data for this number of bins lowering number of bins to 25
!! Warning, the two types of estimates of 1n coverage differ a lot ( 11.9 and 58.4899889386491 )
i.e. at least of one of the smudgeplot methods to estimate the haploid coverage got it wrong
Using subset estimate instead of highest peak estimate (less precise but also less often completely wrong)
Does the smudgeplot look sane? Is at least one of the 1n estimates close a GenomeScope estimate?
You can help us imrove this software by sharing this strange smudgeplot on https://github.com/KamilSJaron/smudgeplot/issues.
!! Warning, your coverage filter on the lower end (L = 7 ) is higher than half of the 1n coverage estimate ( 1n / 2 = 5.95
If the real 1n coverage is half of your estimate you would not picked it up due to the filtering.
If you have sufficient coverage, consider reruning the analysis with lower L (something like (1n / 2) - 5)
One good way for verificaiton would be to compare it to GenomeScope estimate of haploid coverage
This L seems about right though, from looking at the Genomescope plot
I think that's about it, but any thoughts are welcome.
WHen you run genomescope and smudgeplot they need to make sense together. The 1n coverage should be at least moreless consistent between the two. Both those software are able to take coverage priors which should allow you to make them converge somewhere close to the prior value. It's paramater -n
in smudgeplot, and -l
in genomescope.
So, your -p 4
genomescope of the TC, should be run with -p 4 -n 100
to make the two consistent. Unfortunatelly, in that sample, the AB smudge does not look super clear - I have hard times deciding if it's noise or not. If so, maybe the ploidy is actually misinterpreted (the AABB smudge would be AB - it changes the perspective on where is the 1n peak).
The same with Leisch. WHat your original smudgeplot shown was a smudge indicating the 1n coverage is ~58 (and that is a solid and very reliably looking smudge). Give that prior to genomescope, I think that would be the right model.
Hi Kamil and folks,
Thanks for creating this clever method to examine genome structure using k-mers and releasing the new beta version for us to test out. I'm hoping to get some feedback on how to optimize these plots and interpret the results from running it on our data. I have been using jellyfish/GenomeScope2.0 and smudgeplot (the latest sploidyplot branch) using FASTK for this analysis.
We have two protist species we are working on, one called TM and the other called TC of unknown ploidy. We are trying to determine the ploidy of these two species using your software:
TM species -- Genomescope2.0 plot (p=2 yielded the lowest error):
TM species -- FASTK and smudgeplot commands
TM species -- Smudgeplot:
Smudgeplot seemed to run without errors or warnings in the case of species TM
TC species -- Genomescope2.0 plot (p=2 yielded the lowest error):
TC species -- FASTK and smudgeplot commands
TC species Smudgeplot:
Smudgeplot gave the following warning in the case of TC:
First, how do I resolve this warning/error? Reading it, the first part says to "rerun analysis with lower L" but then the second part advises to "increasing L might help remove erroneous kmers". Should I go lower or higher? :P How do I vary L in
smudgeplot plot
and how do I determine what is an appropriate L? (i.e what level would you suggest given the kmer coverage in the Genomescope plot?)Second, do you think that these data are consistent with the smudgeplot prediction that these organisms are diploid? One of the questions we also had was whether Genomescope2.0 and Smudgeplot give enough information to distinguish diploid vs. haploid organisms. In the next comment, I will also show data from trying to run Genomescope2.0 and smudgeplot on a known haploid organism for comparison.
Thank you!