KamilSJaron / smudgeplot

Inference of ploidy and heterozygosity structure using whole genome sequencing data
Apache License 2.0
229 stars 24 forks source link

Trouble with interpretation #99

Closed diyasen2021 closed 2 years ago

diyasen2021 commented 2 years ago

Hi Kamil,

I've run into another issue with some other genomes I am working on. These are also plant pathogens. We expect that they are diploids with low heterozyosity. I used the following commands.

ls *.fastq.gz > FILES kmc -k21 -t16 -m64 -ci1 -cs10000 @FILES kmcdb tmp kmc_tools transform kmcdb histogram kmcdb_k21.hist -cx10000 L=$(smudgeplot.py cutoff kmcdb_k21.hist L) U=$(smudgeplot.py cutoff kmcdb_k21.hist U) kmc_tools transform kmcdb -ci"$L" -cx"$U" dump -s kmcdb_L"$L"_U"$U".dump smudgeplot.py hetkmers -o kmcdb_L"$L"_U"$U" < kmcdb_L"$L"_U"$U".dump

Resulting plot SRR7718974_smudgeplot

and here is the genomescope kmer distribution SRR7718974_genomescope where kcov=29.7, which is approx 1/2 of 1n estimate of smudgeplot.

So if I force smudgeplot to use 1n=30, smudgeplot.py plot -o SRR7718974 -t "SRR7718974" -k 21 -n 30 -q 0.99 SRR7718974files_L32_U3700_coverages.tsv

the prediction changes to tetraploid. SRR7718974_smudgeplot_n30

Which of these 2 scenarios do you think is more likely?

Thanks in advance for helping out!]

Diya

KamilSJaron commented 2 years ago

Hi Diya,

I think the second smudgeplot is "correctly labelling" the smudges - i.e. the smudge you see is indeed tetraploid smudge of AABB type. HOWEVER, this can happen if the heterozygosity is really really low - so low that the singal is dominated by paralogs in the genome. The whole idea of smudgeplot is based on at least some detectable heterozygosity and these homozygous genomes are really sneaky.

We have added a flag --homozygous for these cases, so you can get the labels right while still having a sane ploidy prediction (please double check that's how the flag is called, I am fishing from the back of my mind :-) ).

smudgeplot.py plot -o SRR7718974 -t "SRR7718974" -k 21 -q 0.99 SRR7718974files_L32_U3700_coverages.tsv --homozygous

It's the same (but more extreme) example of the same phenomenon as we discussed in the strawberry tutorail (check the wiki). I think you might see more het. loci if you lower your L to ~25, but at the same time, i don't think there is anything new you can learn from that smudgeplot. I bet my shoes you got the genomescope model right.

Hope this helps!

diyasen2021 commented 2 years ago

Many thanks again for the reply.

So I ran smudgeplot.py plot -o SRR7718974 -t "SRR7718974" -k 21 -q 0.99 SRR7718974files_L32_U3700_coverages.tsv --homozygous

and ended up getting

SRR7718974_smudgeplot

Therefore much like the previous genome, this one too has the AABB haplotype structure but is a homozygous diploid! Very cool and gives us something to look further into as far as repeats/duplications go.

KamilSJaron commented 2 years ago

Great, I am glad I could help and even more that you find the outcome interesting because that is something I think of quite often myself :-) So please, keep in touch, I am interested in figuring out good way to analyse (find a good interpretation) these "higher ploidy smudges".

diyasen2021 commented 2 years ago

Well, I was going to leave you alone until I started messing around some more and found something else that doesnt make sense to me from a problem i posted on github a while ago. Here is the original post https://github.com/KamilSJaron/smudgeplot/issues/99.

This genome was originally labelled as a diploid and then later changed to tetraploid after adjusting 1n coverage (image below). So in this figure we see a lot of kmers of 2n coverage pairing with each other. My question is how do the 2n kmers pair with each other? Shouldnt the 2n peak be comprised of single copy kmers? I should have spotted this right then but missed it somehow. Apologies for reopening an old issue.

As always, thanks for your time. Hope I'm not bothering you too much.

[image: image.png]

[image: image.png]

Diya

On Sun, Mar 20, 2022 at 11:36 PM Kamil S. Jaron @.***> wrote:

Great, I am glad I could help and even more that you find the outcome interesting because that is something I think of quite often myself :-) So please, keep in touch, I am interested in figuring out good way to analyse (find a good interpretation) these "higher ploidy smudges".

— Reply to this email directly, view it on GitHub https://github.com/KamilSJaron/smudgeplot/issues/99#issuecomment-1073220837, or unsubscribe https://github.com/notifications/unsubscribe-auth/AU2LPXD35MSWMX63ZEHBU33VA35SPANCNFSM5QYKLWZQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

-- Diya Sen, Ph.D. CSIR Senior Research Associate Computational Genomics Lab CSIR - Indian Institute of Chemical Biology

KamilSJaron commented 2 years ago

Hi Diya,

no worries :-). I don't see the images, but there are things I can comment on.

So in this figure we see a lot of kmers of 2n coverage pairing with each other. My question is how do the 2n kmers pair with each other?

Smudgeplot finds all the k-mers that are distant by exactly 1 nt and form a unique pair.

Shouldnt the 2n peak be comprised of single copy kmers?

Yes, each of those 2n k-mers are is (likely) homozygous and unique in the genome. They represent "nearly" the same sequences and I suspect most of those k-mer pairs will paralogs. However, so far no one really explored where these k-mer pairs come from in the genome (that is yet something we need to do).

diyasen2021 commented 2 years ago

If they are paralogs, wouldn't their coverage be > 2n?

On Sun, Apr 10, 2022 at 7:29 AM Kamil S. Jaron @.***> wrote:

Hi Diya,

no worries :-). I don't see the images, but there are things I can comment on.

So in this figure we see a lot of kmers of 2n coverage pairing with each other. My question is how do the 2n kmers pair with each other?

Smudgeplot finds all the k-mers that are distant by exactly 1 nt and form a unique pair.

Shouldnt the 2n peak be comprised of single copy kmers?

Yes, each of those 2n k-mers are is (likely) homozygous and unique in the genome. They represent "nearly" the same sequences and I suspect most of those k-mer pairs will paralogs. However, so far no one really explored where these k-mer pairs come from in the genome (that is yet something we need to do).

— Reply to this email directly, view it on GitHub https://github.com/KamilSJaron/smudgeplot/issues/99#issuecomment-1094111723, or unsubscribe https://github.com/notifications/unsubscribe-auth/AU2LPXHADRBQIJ6JXPF5MHLVEHK75ANCNFSM5QYKLWZQ . You are receiving this because you authored the thread.Message ID: @.***>

-- Diya Sen, Ph.D. CSIR Senior Research Associate Computational Genomics Lab CSIR - Indian Institute of Chemical Biology

KamilSJaron commented 2 years ago

Paralogs do not have to be identical sequences, right?

So I expect those k-mer often do have a common ancestor in (relatively) recent history because they are quite similar to each other, but they are not exactly the same - they are (by definition of smudgeplot k-mer pairs) different in exactly one 1nt.

More visually, imagine situation like this

---k1---------k2----
---k1---------k2----

where k1 and k2 are two k-mers that are exactly 1 nt divergent from each other and - is some other sequence we don't care about. Both are homozygous, probably paralogs, but not the same. So each of them has ~2n coverage and together they will contribute to AABB smudge. Does it make sense?

Why would you need the coverage to be >2n?

diyasen2021 commented 2 years ago

Hi Kamil,

Yes, it makes perfect sense!

Thanks for the explanation.

Diya

On Mon, Apr 11, 2022 at 1:10 AM Kamil S. Jaron @.***> wrote:

Reopened #99 https://github.com/KamilSJaron/smudgeplot/issues/99.

— Reply to this email directly, view it on GitHub https://github.com/KamilSJaron/smudgeplot/issues/99#event-6403909740, or unsubscribe https://github.com/notifications/unsubscribe-auth/AU2LPXBFFKS3YSV2MVAW4KDVELHMNANCNFSM5QYKLWZQ . You are receiving this because you authored the thread.Message ID: @.***>

-- Diya Sen, Ph.D. CSIR Senior Research Associate Computational Genomics Lab CSIR - Indian Institute of Chemical Biology

KamilSJaron commented 2 years ago

Cool