Trouble with interpretation

diyasen2021 commented 2 years ago

Hi Kamil,

I have run smudgeplot on my plant pathogenic species and the plots for all but 1 isolate turned out as diploid, as expected. The 1 that did not fit the expectation was predicted to be a pentaploid. Here Is the command. The L and U coverages were estimated by smudgeplot.

L=$(smudgeplot.py cutoff 3813_k21.hist L)
U=$(smudgeplot.py cutoff 3813_k21.hist U)
kmc_tools transform 3813_counts -ci84 -cx10000 dump -s 3813_L84_U10000.dump
smudgeplot.py plot -o 3813 -t "3813" -q 0.99 3813_L84_U10000_coverages.tsv

it look like this: 3813_L84_U10000_smudgeplot

Next, I ran genomescope to get the 1n coverage and here is the output: 3813-genomescope

So, I changed L from 84 to 200 and reran smudgeplot.

kmc_tools transform 3813_counts -ci200 -cx10000 dump -s 3813_L84_U10000.dump
smudgeplot.py plot -o 3813 -t "3813" -q 0.99 3813_L84_U10000_coverages.tsv

This was the result: 3813_L200_U10000_smudgeplot

So what is going on here? Could you please help me understand.

Many thanks,

Diya

KamilSJaron commented 2 years ago

Hi Diya,

This is a cool genome you got here. Good job on the investigation.

I see 2 possible explanations:

Explanation 1: the 1n coverage is ~225, the genome is cca 38M, extremely (or completely) homozygous and quite a lot of duplications. - this is what GenomScope fit. Then the explanation of Smudgeplot thinking the 1n coverage is 455 is that there are simply no heterozygous loci - there are nearly no 225x kmers and probable none (or nearly none) is exactly one SNP from the other - meaning, there is no real heterozygous AB smudge and the algorithm mislabeled the smudges (so what is labeled as AB would be in fast AABB, etc).
Explanation 2: the 1n coverage ~450, the genome will be probably less than 20M and it will be extremely heterozygous. That would explain very well the AB smudge and the prediction. I also suspect that if you chose a smaller U or simply chose smaller -q parameter, you will see the AB smudge is actually a nicely looking one (judging by the first smudgeplot you made). If this is the case, it means the GenomeScope did not fit the 1n coverage right and the model should be redone.

I suspect that if you assemble the reads, you will get for sure ~40M of assembled sequences, with this heterozygosity it would be difficult not to assemble the haplotypes separately. I would try to think of some sort of downstream analysis that would help you disentangle the two.

Let me know if it makes sense :-) and good luck!

diyasen2021 commented 2 years ago

Hi Kamil,

Thanks for the explanation. Based on what we know from genome size and homozygosity of the reference genome, explanation no 1. is more likely. Does this mean that the genome is diploid but has experienced recent duplications? Could this be caused by unequal number of chromosomes (chromosome duplication/loss)?

Diya

KamilSJaron commented 2 years ago

I think that is very possible (and I guess even quite likely). Basically altough the smudgeplots you got are a bit ugly, they do inform you that there are very few (if any) kmer pairs where both come from the 1n coverage peak (assuming scenario 1), while there are tons of k-mer pairs that consist of two k-mers from the 2n peak. That can happen in diploid genomes if the heterozygosity is extremely low, for example in one of the diploid strawberries. So that would be my interpretation.

Would you like to share me the k-mer pair file: 3813_L84_U10000_coverages.tsv? I think there are a few k-mer down there that confused the 1n coverage estimate in the first plot. I suspect that if you specify your coverage guess (225), the smudgeplot might get nicer. I guess I would like smudgeplot to work for as nice data as yours so I would like to see if I can make some adjustments to the algorithm to get the default one look nice right away.

diyasen2021 commented 2 years ago

Hi Kamil, How would kmers from the 2nd peak produce an AB smudge if they are from homozygous sites? Do you mean that kmers from multiple loci are pairing with each other because of recent duplications, for example from repeat regions?

Here is the coverage file. 3813_L84_U10000_coverages.txt

Thanks again for helping make sense of this.

Diya

KamilSJaron commented 2 years ago

Hi Diya,

I played a tiny bit with the file. I must say, I don't belive I see even a slight suggestion of heterozygous k-mer pairs in the dataset assuming the scenario 1 (the 1n coverage is ~225).

I first run smudgeplot while forcing there the suspected coverage

smudgeplot.py plot 3813_L84_U10000_coverages.txt -L 84 -k 21 -n 224 -q 0.99 -o 3813_q99

3813_q99_smudgeplot_log10

That worked quite well for making sure that the peak annotations fit the smudges really well with this haploid coverage (not that we would suspect otherwise). Anyway, the log plot also shows why smudgpelot is a bit confused here - there are quite a few k-mers all over the place and smudgeplot just have hard times identifying the important peak because of all the clutter with lot lower coverage (I those will be likely just error k-mers, which makes me wonder how was the library made - is it an amplified sample). Anyway, I actually think this one is kind of alright as it is.

Then I tried one more thing, I tried to remove the low coverage k-mers that confuse the algorithm. It is not exactly right (would be better if the kmer dump we searched for k-mer pair for was with L = 150, but I think your L = 200 would work just fine too). Anyway, I did it like this:

cat 3813_L84_U10000_coverages.txt | awk '{ if( $1 > 150 ){ print $0} }' > 3813_L150_U10000_coverages.txt

And then run smudgeplot why disclaiming the genome is completely homozygous (I also set the filtering quantile a lot lower, but I don't think that changes that much). So running is as

smudgeplot.py plot 3813_L150_U10000_coverages.txt -L 150 -k 21 -q 0.9 -o 3813_q9_L150 --homozygous

We actually get a really nice smudgeplot

3813_q9_L150_smudgeplot

3813_q9_L150_smudgeplot_log10

which is very likely with "the right" annotations of smudges (again, assuming scenario 1).

So I think our work here is done, closing the issue for now, if anything, please don't hesitate to reopen it. Hope this helps.

diyasen2021 commented 2 years ago

Hi Kamil,

Many thanks for helping out with the analysis. I do have some follow up questions for you.

What does the AABB smudge indicate?
Why is the coverage 4n and not 2n?

This genome had insanely high coverage in comparison to the others. Could that have some connection to the kmer distribution we are seeing? Is there a possibility of contamination?

Cheers, Diya

On Mon, Feb 14, 2022 at 10:54 AM Kamil S. Jaron @.***> wrote:

Closed #96 https://github.com/KamilSJaron/smudgeplot/issues/96.

— Reply to this email directly, view it on GitHub https://github.com/KamilSJaron/smudgeplot/issues/96#event-6062861026, or unsubscribe https://github.com/notifications/unsubscribe-auth/AU2LPXA4HRGLWMKDJURADYDU3ASC5ANCNFSM5NX2HVYA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

-- Diya Sen, Ph.D. CSIR Senior Research Associate Computational Genomics Lab CSIR - Indian Institute of Chemical Biology

KamilSJaron commented 2 years ago

If the genome is diploid then AABB correspond to k-mer from two distinct loci in the genome that happen to be very similar (I like to call them paralogs, but it's hard to say what they really are, besides nearly identical to each other in all but one nucleotide).

As a matter of fact, as you genome is nearly or completely homozygous (I am still assuming the scenario 1 we discussed above applies), therefore practically all the k-mers that will be similar to each other will be these "paralogs".

Why is the coverage 4n? Well, if 1n is ~224, and we see a smudge formed of a k-mer pair that has a sum of coverage ~900x and coverage ratio of ~0.5, it implies that it is a 4n smudge of AABB structure.

It is a beautiful dataset - in sense that the coverage is really luxurious. I don't think there is a problem with too much contamination (those make the plot overall a bit noiser if anything, but smudgeplot actually is relatively robust against that). I think the only question is if there is a good explanation of the extremly homozygous state of the genome. Perhaps if the species can reproduce via parthenogenesis, some forms can cause big loses of heterozygosity, or if there was some local imbreeding or if it's a lab line... It's hard to guess not knowing anything about the sample... But if these low levels of heterozygosity are possible, then I don't think there is a real puzzle in this data. It's quite clear what you got.

If you want to make sure, try to map reads back to the assembled genome and look how many potential heterozygous sites you see. My prediction is nearly nothing and what you will see will have spurious coverage profiles. If you would be interested in that kind of exploration, you can get inspired in our study of stick insects: https://www.biorxiv.org/content/10.1101/2020.11.20.391540v4 I have spent a lot of effort to detect some real heterzygous variants in genomes that had none and the smudgeplot also looked quite similar - the lowest detectable smudge was AABB.

diyasen2021 commented 2 years ago

Thanks for the detailed explanation. That makes it so much clearer. Given your explanation, I came up with 2 further possible explanations (in addition to the presence of multiple paralogs).

This species is known to sometimes show heterokaryosis, i.e. the coexistence of nucleii from distinct parental lines. I wonder if the paralogs are then coming from homologous sites on different nuclei. or
The species is a homotetraploid., ie. it has undergone a recent whole genome duplication from a homodiploid and does not have many heterozygous sites.

It is an interesting genome and your tools is great. It takes a bit of time to properly understand the output, but thats only because my understanding of the subject is quite poor.

Thanks for the help too.

Cheers, Diya

On Wed, Feb 16, 2022 at 2:08 PM Kamil S. Jaron @.***> wrote:

If the genome is diploid then AABB correspond to k-mer from two distinct loci in the genome that happen to be very similar (I like to call them paralogs, but it's hard to say what they really are, besides nearly identical to each other in all but one nucleotide).

As a matter of fact, as you genome is nearly or completely homozygous (I am still assuming the scenario 1 we discussed above applies), therefore practically all the k-mers that will be similar to each other will be these "paralogs".

Why is the coverage 4n? Well, if 1n is ~224, and we see a smudge formed of a k-mer pair that has a sum of coverage ~900x and coverage ratio of ~0.5, it implies that it is a 4n smudge of AABB structure.

It is a beautiful dataset - in sense that the coverage is really luxurious. I don't think there is a problem with too much contamination (those make the plot overall a bit noiser if anything, but smudgeplot actually is relatively robust against that). I think the only question is if there is a good explanation of the extremly homozygous state of the genome. Perhaps if the species can reproduce via parthenogenesis, some forms can cause big loses of heterozygosity, or if there was some local imbreeding or if it's a lab line... It's hard to guess not knowing anything about the sample... But if these low levels of heterozygosity are possible, then I don't think there is a real puzzle in this data. It's quite clear what you got.

If you want to make sure, try to map reads back to the assembled genome and look how many potential heterozygous sites you see. My prediction is nearly nothing and what you will see will have spurious coverage profiles. If you would be interested in that kind of exploration, you can get inspired in our study of stick insects: https://www.biorxiv.org/content/10.1101/2020.11.20.391540v4 I have spent a lot of effort to detect some real heterzygous variants in genomes that had none and the smudgeplot also looked quite similar - the lowest detectable smudge was AABB.

— Reply to this email directly, view it on GitHub https://github.com/KamilSJaron/smudgeplot/issues/96#issuecomment-1040975022, or unsubscribe https://github.com/notifications/unsubscribe-auth/AU2LPXB5XP2J6XEK4HALTGLU3L2KBANCNFSM5NX2HVYA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

-- Diya Sen, Ph.D. CSIR Senior Research Associate Computational Genomics Lab CSIR - Indian Institute of Chemical Biology

KamilSJaron commented 2 years ago

No worries.

Now just briefly, I think the heterokaryosis is unlikely - your peaks are very clearly spaced in even manner. That suggests that each cell of the tissue that was sequenced had the same karyotypes (you can check our other preprint about the signature of sequencing multiple tissues with different karyotypes, the setting a bit different, but TLDR evenly spaced peaks means it's usually a homogenious tissue). HOWEVER, genomes just have paralogs, having some homologous regions within a regular diploid genome IS normal. if there is anything strange on your genome, it's the probably homozygosity I would say.

diyasen2021 commented 2 years ago

Very interesting point about the lack of heterozygosity. This exactly matches what I'm finding in SNP analysis of these genomes.

Thanks again for the explanation.

Cheers, Diya

On Thu, Feb 17, 2022 at 7:55 AM Kamil S. Jaron @.***> wrote:

No worries.

Now just briefly, I think the heterokaryosis is unlikely - your peaks are very clearly spaces in even manner. That suggests that each cell of the tissue that was sequenced had the same karyotypes (you can check our other preprint https://www.biorxiv.org/content/10.1101/2021.11.12.468426v1 about the signature of sequencing multiple tissues with different karyotypes, the setting a bit different, but TLDR evenly spaced peaks means it's usually a homogenious tissue). HOWEVER, genomes just have paralogs, having some homologous regions within a regular diploid genome IS normal. if there is anything strange on your genome, it's the probably homozygosity I would say.

— Reply to this email directly, view it on GitHub https://github.com/KamilSJaron/smudgeplot/issues/96#issuecomment-1042031932, or unsubscribe https://github.com/notifications/unsubscribe-auth/AU2LPXCNMX5MN7G5PNC2OSDU3PXIZANCNFSM5NX2HVYA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

-- Diya Sen, Ph.D. CSIR Senior Research Associate Computational Genomics Lab CSIR - Indian Institute of Chemical Biology

KamilSJaron / smudgeplot

Trouble with interpretation #96