KamilSJaron / smudgeplot

Inference of ploidy and heterozygosity structure using whole genome sequencing data
Apache License 2.0
244 stars 24 forks source link

Using smudgeplot with very high ploidy levels #66

Open cmdoret opened 4 years ago

cmdoret commented 4 years ago

Hi Kamil,

Thanks for the clear documentation and awesome tool.

I have troubles understanding my smudgeplot. I have used follwing command to generate it

smudgeplot.py plot kmcdb_L135_U1000_coverages.tsv -k 21

smudgeplot complained that there were two smudges the same positions and decreased nbins to 12 and it look like this: smudgeplot_smudgeplot_log10

Out of curiosity, I fixed -nbins to 30 to see how it looks, but its basically the same. smudgeplot_smudgeplot_log10

The organism in question is an amoeba (Acanthamoeba castellanii). It is a weird beast, it reproduces asexually (but there is some evidence of meiosis). There are also evidence of reproduction by amitosis, which would make it very homozygous. The exact number of chromosomes per haploid genome is in the range of 30-40, and the ploidy has been estimated to approximately 25. We know the haploid genome size is ~45Mb.

What I fed to smudgeplot is an illumina shotgun library of 223M 2x150bp reads.

Smudgeplot seems to think there are many chromosomes (~500) with a low ploidy level (3-4), but it is likely the opposite. Is it even possible to use smudgeplot with such high levels of ploidies ?

KamilSJaron commented 4 years ago

Hi Cyril, I am glad to see you work on freaky stuff now :-)

(223e6 * 300) / 45e6 = 1486.667 (here 45e6 is the genome size). That's a psycho high coverage! It should be divided by ploidy to get 1n cov estimate, and I totally see why you thought of giving smudgpelot a try.

We did not really thought of 45-ploids when we designed smudgeplot and dependent on the difergence of haplotypes it is / is not expected to work. If there is too much diversity between the ploidies, no kmer pairs will be found as unique (there will be always more than a pair) and indeed in your smudgeplot you don't have that many kmers, maybe there is one problem.

Another thing I noticed is that you have U=1000. Use something MUCH higher. 3000 or 5000. Almost all the kmers are on the upper edge, you got to spread them in the plane. Let it go as high as it needs to go. Also L = 135 might be too high if your you have really a 45 ploid (then you would expect 1n coverage to be 1486 / 45 ~= 33x; here 45 is the guessed ploidy). Would you mind posting here the histogram too? I would be quite curious to see how it looks like...

cmdoret commented 4 years ago

Thank you for all the advice !

I tried running again with L=33 and U=5000. Indeed this increases the estimated ploidy level to 10, with 107 predicted chromosomes: smudgeplot_smudgeplot_log10

By plotting the histogram, you mean with genomescope ? If so, here it is:

log_plot

I'm not sure why it's truncated at 1000, I plotted the histogram using genomescope.R -i kmcdb_k21_L33_U5000.hist -o genomescope -m 5000 -p 2-m 5000

KamilSJaron commented 4 years ago

Cool, we are getting somewhere. We see the most of the kmer pairs still up, but it does look much nicer than before (it's not railing the top edge). The genomic smudges should always look like smudges if they are too much a different shape they are most likely an artefact of some sorts. I also hardcoded ylim to be 10x the 1n estimate. Which is not unfortunatelly messing with your plot. Would you be comfortable adjust that bit of code in your version (l71 in the smudgeplot_plot.R script)? It's rather specialised usage, I might add you a flag "--expect-extreme-ploidies", haha.

The histogram is kind of messy. The truncation due to a flag you specified to KMC (they have the argument that specifies the threshold on the counter). I usually use genomescope to visualize kmerspectra even I don't really care about the model, but maybe in your case you got to write your own R script to give it a reasonable limits for x/y axis.

If the smudgeplot runs fast, I would recommend trying a few more L values, because it's really hard to say from the kmer spectra what is a reasonable L value - I would try 50 and 100 at least.

cmdoret commented 4 years ago

As you suggested, I tried removing the hardcoded threshold (in PR #68) but it doesn't seem to be much better. Here is an example with smudgeplot_plot.R -i "kmcdb_L50_U5000_coverages.tsv" -o "smudgeplot" -k 21 -L 50 -nbins 30 smudgeplot_log_kmcdb_L50_U5000_nbins30
and the same with --extreme_ploidies:

smudgeplot_log_kmcdb_L50_U5000_e_nbins30

I noticed using the -q to keep the top 90 or 95% quantile seems to increase the proposed ploidy, but I'm not sure if that's relevant. smudgeplot.py plot kmcdb_L50_U5000_coverages.tsv -L 60 -q 0.90 -nbins 30 -e

smudgeplot_smudgeplot_log10

Here is the histogram, visualized in R with L=100 and U=5000: hist_L50_U5000 And the log scale plot: log_hist_L50_U5000

I'm not sure the histogram is very helpful...

if I understand correctly, the smudges are too close and appear as a single large smudge ? The -q option seems to separate them better but I'm not sure if it's a good idea to just discard the highly covered ones :/

Thanks a lot for your help :smile: