KamilSJaron / smudgeplot

Inference of ploidy and heterozygosity structure using whole genome sequencing data
Apache License 2.0
240 stars 23 forks source link

Request for feedback on smudgeplot optimization and interpretation (diploid? haploid?) #133

Open margaretc-ho opened 1 year ago

margaretc-ho commented 1 year ago

Hi Kamil and folks,

Thanks for creating this clever method to examine genome structure using k-mers and releasing the new beta version for us to test out. I'm hoping to get some feedback on how to optimize these plots and interpret the results from running it on our data. I have been using jellyfish/GenomeScope2.0 and smudgeplot (the latest sploidyplot branch) using FASTK for this analysis.

We have two protist species we are working on, one called TM and the other called TC of unknown ploidy. We are trying to determine the ploidy of these two species using your software:

TM species -- Genomescope2.0 plot (p=2 yielded the lowest error):

transformed_log_plot transformed_linear_plot

TM species -- FASTK and smudgeplot commands

FastK -v -t4 -k31 -M16 -T16 9076_9077_PE_FRAG_R1_R2_readscombined.fastq.gz -NTmFastK_Table
PloidyPlot -e12 -k -v -T16 TmFastK_Table 
smudgeplot.py plot -t Tm_gIllumina_MiSeq -o Tm_gIllumina_MiSeq_run FastK_Table_text.smu 

TM species -- Smudgeplot:

image image

Smudgeplot seemed to run without errors or warnings in the case of species TM

TC species -- Genomescope2.0 plot (p=2 yielded the lowest error):

transformed_log_plot transformed_linear_plot

TC species -- FASTK and smudgeplot commands

FastK -v -t4 -k31 -M16 -T16 ${Tc_gIlluminatrim_1_fq_gz} ${Tc_gIlluminatrim_2_fq_gz} -NTc_gIlluminaFastK_Table
PloidyPlot -e12 -k -v -T16 Tc_gIlluminaFastK_Table
smudgeplot.py plot -t Tc_gIllumina -o Tc_gIllumina_run Tc_gIlluminaFastK_Table_text.smu 

TC species Smudgeplot:

image image

Smudgeplot gave the following warning in the case of TC:

detecting two smudges at the same positions, not enough data for this number of bins lowering number of bins to         35
detecting two smudges at the same positions, not enough data for this number of bins lowering number of bins to         30
!! Warning, your coverage filter on the lower end (L =  12      ) is higher than half of the 1n coverage estimate ( 1n / 2 =    10.55
If the real 1n coverage is half of your estimate you would not picked it up due to the filtering.
If you have sufficient coverage, consider reruning the analysis with lower L (something like (1n / 2) - 5)
One good way for verificaiton would be to compare it to GenomeScope estimate of haploid coverage
1 peaks of kmer pairs detected with coverage < (1n_coverage * 2) = 42.2
  kmers_in_peak[#] kmers_in_peak[proportion] summit B / (A + B) summit A + B
1           202830                      0.35               0.48        26.12
This might be due to kmers with sequencing errors present in the kmer dump.
Increasing L might help remove erroneous kmers.

First, how do I resolve this warning/error? Reading it, the first part says to "rerun analysis with lower L" but then the second part advises to "increasing L might help remove erroneous kmers". Should I go lower or higher? :P How do I vary L in smudgeplot plot and how do I determine what is an appropriate L? (i.e what level would you suggest given the kmer coverage in the Genomescope plot?)

Second, do you think that these data are consistent with the smudgeplot prediction that these organisms are diploid? One of the questions we also had was whether Genomescope2.0 and Smudgeplot give enough information to distinguish diploid vs. haploid organisms. In the next comment, I will also show data from trying to run Genomescope2.0 and smudgeplot on a known haploid organism for comparison.

Thank you!

margaretc-ho commented 1 year ago

To try to address the question of how read data from a haploid organism would look, we also took short read data from Cryptosporidium (known to be haploid) and we also ran these through Genomescope and Smudgeplot.

This is what the Crypto data shows for p=1 in GenomeScope2.0 transformed_linear_plot transformed_log_plot

This is what the Crypto data shows for p=2 in Genome Scope2.0

transformed_linear_plot transformed_log_plot

If this is known to be haploid, I am a bit worried about the fact that the error rate is lower for p=2 than it is for p=1 model in Genomescope2.0. Any thoughts about this?

And the smudgeplot for Crypto: (same commands as were run for TC and TM in the analyses above)

image image

Running the crypto smudgeplot also yielded a similar warning about coverage filter L

detecting two smudges at the same positions, not enough data for this number of bins lowering number of bins to         35
detecting two smudges at the same positions, not enough data for this number of bins lowering number of bins to         30
detecting two smudges at the same positions, not enough data for this number of bins lowering number of bins to         25
detecting two smudges at the same positions, not enough data for this number of bins lowering number of bins to         20
detecting two smudges at the same positions, not enough data for this number of bins lowering number of bins to         18
detecting two smudges at the same positions, not enough data for this number of bins lowering number of bins to         16
detecting two smudges at the same positions, not enough data for this number of bins lowering number of bins to         14
!! Warning, your coverage filter on the lower end (L =  12      ) is higher than half of the 1n coverage estimate ( 1n / 2 =    7.8
If the real 1n coverage is half of your estimate you would not picked it up due to the filtering.
If you have sufficient coverage, consider reruning the analysis with lower L (something like (1n / 2) - 5)
One good way for verificaiton would be to compare it to GenomeScope estimate of haploid coverage

Would appreciate any thoughts you have about these runs!

margaretc-ho commented 1 year ago

I will also add that we did run on diploid/aneuploid organism (Leishmania) and the smudgeplot for this one was quite cool.

transformed_linear_plot transformed_log_plot

image image
KamilSJaron commented 1 year ago

Hi,

I am hurry preparing for the GenomeScope workshop tomorrow, in short.

TM - is either super homozygous diploid, or (more likely) haploid. The quality of genomescope fit is not necessarily the best indicator of which model is the biologically more relevant if the quality of fit differs just marginally. It's more like that if you see the fit is really bad you know the model is just wrong (in one way or another). So, I think this is much as you can get out of profiling of this species.

The TS does not really show much. As you say, the problem is the L threshold is too low. The -e12 in PloidyPlot is the -L parameter in smudgeplot (that is one of the reasons why I created the smudgeplot.py hetmers wrrapers that uses parameters consistent with previous versions). So, I would suggest something like PloidyPlot -e70 -k -v -T16 Tc_gIlluminaFastK_Table

The Crypto data is a lot more messy, it's really hard to say if that's because there is no clear separation of peaks in the k-mer spectra.

Leishmania does look quite funky. By the way, smudgeplot does look nice, and shows quite clearly that the gneomescope model is wrong, I would try to rerun it and with a coverage prior (what smudgeplot suggested) and perhaps trying both diploid and tetraploid genomes.

margaretc-ho commented 1 year ago

Thanks for the quick reply! It is interesting to hear your thoughts on TM being potentially haploid or fairly homozygous diploid, I think the collaborators will be interested to hear and follow up with ploidy experiments with both TM and TC. Thanks for the explanation of L a bit better, I think I can choose appropriate L a little better now.

I did fiddle around and rerun TC and leishmania last week, here are the results:

TC rerun with -e70 parameter, it does look intriguing. It ran without warnings

image image

Running Genomescope p=3 and p4 out of curiousity on TC: p=3

image image

p=4

image image

I suspect the larger smear/peak of higher coverage k-mers is making it hard to resolve the ploidy of TC? Any thoughts?

margaretc-ho commented 1 year ago

For Leish which as been reported to have a LOT of aneuploidy, I think these Genomescope plots are definitely showing higher ploidy is possible p=3

image image

p=4

image image

Smudgeplot with L=7

image image

It does seem like choosing L is tricky, this was the warning I got from running L=7

detecting two smudges at the same positions, not enough data for this number of bins lowering number of bins to     35
detecting two smudges at the same positions, not enough data for this number of bins lowering number of bins to     30
detecting two smudges at the same positions, not enough data for this number of bins lowering number of bins to     25
!! Warning, the two types of estimates of 1n coverage differ a lot (    11.9    and 58.4899889386491    )
i.e. at least of one of the smudgeplot methods to estimate the haploid coverage got it wrong
Using subset estimate instead of highest peak estimate (less precise but also less often completely wrong)
Does the smudgeplot look sane? Is at least one of the 1n estimates close a GenomeScope estimate?
You can help us imrove this software by sharing this strange smudgeplot on https://github.com/KamilSJaron/smudgeplot/issues.
!! Warning, your coverage filter on the lower end (L =  7   ) is higher than half of the 1n coverage estimate ( 1n / 2 =    5.95
If the real 1n coverage is half of your estimate you would not picked it up due to the filtering.
If you have sufficient coverage, consider reruning the analysis with lower L (something like (1n / 2) - 5)
One good way for verificaiton would be to compare it to GenomeScope estimate of haploid coverage

This L seems about right though, from looking at the Genomescope plot

I think that's about it, but any thoughts are welcome.

KamilSJaron commented 1 year ago

WHen you run genomescope and smudgeplot they need to make sense together. The 1n coverage should be at least moreless consistent between the two. Both those software are able to take coverage priors which should allow you to make them converge somewhere close to the prior value. It's paramater -n in smudgeplot, and -l in genomescope.

So, your -p 4 genomescope of the TC, should be run with -p 4 -n 100 to make the two consistent. Unfortunatelly, in that sample, the AB smudge does not look super clear - I have hard times deciding if it's noise or not. If so, maybe the ploidy is actually misinterpreted (the AABB smudge would be AB - it changes the perspective on where is the 1n peak).

The same with Leisch. WHat your original smudgeplot shown was a smudge indicating the 1n coverage is ~58 (and that is a solid and very reliably looking smudge). Give that prior to genomescope, I think that would be the right model.