KamilSJaron / smudgeplot

Inference of ploidy and heterozygosity structure using whole genome sequencing data
Apache License 2.0
227 stars 24 forks source link

Ploidy on mouse genome #46

Closed jafors closed 2 years ago

jafors commented 4 years ago

Hi!

First of all, thanks for your work with smudgeplot. Great idea!

I wanted to try smudgeplot on some WGS data of murine cancer cells, and thought I'd start with a normal control sample taken from the mouse tail. So, the sample should definitely be diploid. It's WGS, and the number of k-mers was pretty large, so I decided to run every chromosome by itself to stop smudgeplot hetkmers from running into memory troubles.

Now, the problem is, though we expect polyploidy in the cancer celllines, the control should be diploid. The smudgeplots of the various chromosomes show varying ploidies from di- to tetra- or even higher ploidies.

The L and U cutoffs that were suggested using smudgeplot cutoff are also pretty low compared to what you recommend in the Readme (L≈12, U≈420).

I attached smugeplot and genomescope plot for chr1. Do you have any suggestions? Is this kind of input data feasible for smudgeplot? Should I tweak the cutoffs around?

Best, Jan

plot 511950_control_1_smudgeplot 511950_control_1_smudgeplot_log10

KamilSJaron commented 4 years ago

Hello @jafors,

thanks for trying out smudgeplots.

I wanted to try smudgeplot on some WGS data of murine cancer cells, and thought I'd start with a normal control sample taken from the mouse tail. So, the sample should definitely be diploid. It's WGS, and the number of k-mers was pretty large, so I decided to run every chromosome by itself to stop smudgeplot hetkmers from running into memory troubles.

Nice. However, I do wonder if there will enough genetic variation within the cancer lineage to show "heterozygous" loci. I suppose that the origin of cancer polyploid cells will be kind of whole-genome duplication, is that correct? Because in the case there might be a problem that they would be too homozygous to get a genome-wide signal. And it relates to why it gets a wrong estimate for the healthy tissue. I am also intrigued by your chromosome-wise computation decomposition, is it by mapping reads?

Now, the problem is, though we expect polyploidy in the cancer celllines, the control should be diploid. The smudgeplots of the various chromosomes show varying ploidies from di- to tetra- or even higher ploidies.

That sounds like a problem. We originally designed and tested smudgeplots on very heterozygous species with various high ploidy levels and for those, it worked very well. However, for multiple species with low heterozygosity, we are getting mixed signals of heterozygous loci and paralogs leading to smudgeplot as you show here. Where there are more duplicates (AABB) than heterozygous kmers (AB). Note that "estimated ploidy" is a very naive thing, it simply tells us what is the ploidy level with most of the kmer pairs. The visualisation itself is far more important than the estimate.

@tbenavi1 is now actively working on better smudgeplot interpretations using simulation data, but I don't think it's a production-ready code right now.

The L and U cutoffs that were suggested using smudgeplot cutoff are also pretty low compared to what you recommend in the Readme (L≈12, U≈420).

The cutoffs depend on the quality of your data and your data looks beautiful, therefore the L cutoff is fine. Having too low cutoff leads to a different pattern than the one you see (too many paralogs).

I attached smugeplot and genomescope plot for chr1. Do you have any suggestions? Is this kind of input data feasible for smudgeplot? Should I tweak the cutoffs around?

For your application in particular, where you know for sure the genome, I would say the best chance is to compare the healthy diploid with cancer. Without using the ploidy estimate directly. Now, that you know that ~56% of your kmer pairs are paralogs and 40% are heterozygous kmers, you can check if that changes with a cancer data. You might see already the effect on the kmer spectra. I would expect inflation between 0 - 60x and smaller homozygous peak. But perhaps I am too naive on how the cancer genomes look like.

jafors commented 4 years ago

Thanks! That makes a lot of sense to me.

I am also intrigued by your chromosome-wise computation decomposition, is it by mapping reads?

Exactly. I had already mapped the reads, so it was pretty easy just extract them chromosome-wise from the bam. But as I mention it, I think I should definitely check if I don't introduce some weird biases in this step.

I really think that the heterozygosity might be a problem here, since chromosomes where we have even lower het values give some strange results like pentaploidy.

I will stay on the topic and keep you updated, especially on results considering the cancer cells. I also have karyograms of the samples that kind of show what to expect in terms of genome doubling, so maybe we will get something out of this to make a small use-case for similar studies (or a caveat what to consider).

Thanks again!

511950_control_3_smudgeplot_log10

mflevine commented 4 years ago

Hi, Really cool tool! I am trying a similar approach for on human cancer to try to help with ploidy estimation. I ran a test with using all the reads from chr2 and made smudgeplots for the tumor and normal. I am seeing a similar issue in the normal. image For the tumor, the ploidy states don't totally match up with the states from the copy number caller. I also had to adjust the bins. image Here is the profile from the copy number caller: image

KamilSJaron commented 4 years ago

My apologies. I am about to leave for my winter break I don't think I can help you in-depth right now.

Just in short. The approach, we have developed, is based on the dominant patterns of heterozygosity over patterns of genome structure (like any paralogy). I have no idea if this can be the case for cancer-induced poly/aneuploidy or not. I am lacking a lot of background here, if you feel like it's/should be possible, I am more than happy to discuss it with you over Skype, or here, or maybe @tbenavi1 got some thoughts to share. Any way around, I don't think I can help more right now, sorry about that.

P.S. I am quite intrigued by the smudgeplots you shared. They look very well annotated, but I have no idea why do they look so different to each other. But at the same time, I had very limited time to look/think about them, so I don't feel like I should be definitive in any way.

KamilSJaron commented 4 years ago

Hello @mflevine, just found that perhaps I should take a look at this idea. Is it still an open question?

mflevine commented 4 years ago

Yea I still think this could be an interesting application.

KamilSJaron commented 2 years ago

Hi, your intuition was extremely right and I really apologise for not noticing before.

Using the same singal smudgeplot uses - the coverage sum and coverage ratio is a good enough signal to characterise tumor composition. There is a tool that uses very similar source data - Hatchet, although it's not non-parameteric (you have to start with reads mapped to the genome), but that should not be a problem with mouse. The paper is here: https://www.nature.com/articles/s41467-020-17967-y