Plukenetia volubilis genome size mistery

sivico26 commented 4 years ago

Hi @KamilSJaron, @tbenavi1!

I used both Smudgeplot and Genomescope2 for the early stages of my assembly project last year and you have been really helpful(issues #36 and #45 for some context) for me.

Currently, I already did genome assembly and I am doing some post-assembly of the genome, I would say it is almost finished (at the possible extent that our data allows).

However, a mystery emerged after assembly, that I thought would be solved after post-assembly stages, that is seriously drilling my head.

As I had a troublesome genome in hand I explored several approaches. In the end, I error corrected using Brownie Corrector and Karect and then assembled with several assemblers. As I had Nanopore data as well, I also tried Flye as a long-read assembler and hybrid assemblies.

Here is a brief of the final workflow:

assembly_workflow

And the contiguity of the resulting genomes were these (generated with abyss-fac -s 2000):

n	n:2000	L50	min	N75	N50	N25	E-size	max	sum	name
669643	25539	3683	2000	19958	40670	74530	56920	464213	537.3e6	abyss_110.fasta
18738	14216	3184	2000	19616	31064	45848	35529	156402	297.7e6	flye_polished.fasta
13713	10722	1428	2001	55625	112275	213463	168686	1627529	594.3e6	masurca.fasta
775739	29681	4340	2000	17458	35558	62757	47630	351667	538.6e6	megahit.fasta
290756	22588	3169	2000	24016	48992	87791	66992	494742	551.6e6	minia.fasta
688159	20551	1898	2000	37869	88042	169274	123032	750764	616.9e6	platanus.fasta
223514	15277	1719	2000	45373	94114	177120	129199	750764	584.3e6	Platanus_curated.fasta

As you can see, all the short read and hybrid assemblies consistently report genome sizes of +500 Mb. The only exception Is the Flye assembly which originally was around 180 Mb and after polishing with the short reads the assembly size went up to ~ 300 Mb. This is not a surprise as our nanopore data is really bad (low coverage and an N50 of just 2kb), so this was not a viable assembly either way.

Regarding gene completeness, estimated with Busco, all the short read and hybrid assemblies reported +95% completeness, with most of them single-copy (+90%). The Flye assembly was the only one incomplete (~50%).

With these results, I wondered why they were so different from the original estimation of 200 Mb by Genomescope2 (In issue #36 we concluded 250 Mb but it was actually inflated by organellar DNA) so consistently.

At first glance, I thought I had several ploidies in the assemblies, but this is not consistent with the Busco reports, as there is a low proportion of orthologs duplicated (~4-5%). So if the genome is redundant, at least it is not evenly redundant. There should be some regions that are more overrepresented than others.

The next thing that I thought was that maybe some complex repeats gave troubles to the assemblers and, not being able to solve them, the assemblers returned several fragments for those regions generating the redundancy. So I proceeded to quantify repeats.

I used both Repeatmasker and REPET-TEdenovo, with Repbase as a reference database, to search repeats in the masurca assembly (as it was the best).

RepeatMasker reported less than 9.7% of repeats (~58 Mb), while REPET was more conservative with just 40Mb estimated as repeats which is like 6.7%. This was another surprise, as I expected a higher repeat content according to the aforementioned hypothesis. Furthermore, this proportion contrasts a lot with the GenomeScope2 estimate of 17.4%.

At that point, I ran out of ideas but then I found FinichseSC, which is explained here, and essentially help to resolve repeats using long reads, but it also filters out embedded contigs. I used as first post-assembly and got really good results in contiguity, and it did reduce the genome size from 594 Mb to 568 Mb but did not solve the mystery.

The last thing that I tried was to do some phasing with HaploMerger2 , thinking that the genome might have several haplotypes there causing the redundancy. However, again, it did reduce the genome size to 533 Mb but is not a significant reduction and the mystery holds.

So finally I apologize for the extension of the issue and if it may go beyond your scope as developers of these amazing tools, but I would really like to know what you think about this genomic mystery and the differences between estimation and assembly in both genome size and proportion of repeats.

Thanks in advance, Sivico

KamilSJaron commented 4 years ago

Sivico, that's quite a story. Thanks for sharing.

First, I would very much expect an assembly ~500Mbp. The divergence of A and B haplotypes seem to be quite high according to our previous genomescope estimates (~10%). I would expect the A and B to assemble separately. But then I am quite intrigued by the BUSCO, I would indeed expect a much higher fraction of duplicated genes.

I would recommend you to explore what they have done in rotifers or root knot nematodes . Taking all the annotated genes, blasting them against the genome and considering always the secondary hit for each transcript. If you look then on a histogram of similarities, you can get an idea if there is peak around the expected divergence (~10%).

I will think if I can figure something better.

sivico26 commented 4 years ago

Thanks for your quick reply Kamil!

Can you please elaborate on why, given the high heterozygosity, did you expected a ~500 Mb genome size? It is not clear to me. If that is expected, is it predictable as well? Would not Genomescope2 take it into account?

I will take a look at those papers and make your suggested analysis once we make the annotation. I will keep you posted about this if it is of interest.

Regards

KamilSJaron commented 4 years ago

It depends a lot on the setting of an assembled, but for >3% of heterozygosity, one certainly should expect at least some uncollapsed haplotypes and higher heterozygosity (divergence of haplotypes) one has more uncollapsed haplotypes would be expected. In your case, we predicted AABB genome, where the monoploid genome is ~250M, but A <-> B is ~10%. That's a lot, if we got everything right*, then I would expect that the assembly will contain most of the A and B genomes. As they are 250M each, the totally assembly span could be close to 500.

The publications I linked have genomes is substantial divergences - ~8% for root knots, ~25% for rotifers, in both cases they were able to identify homologous regions within the assembly, corresponding to the divergent genomic copies (i.e. the A <-> B pairs).

*the profiling we made could be wrong, but with all honestly the most straightforward explanation of your data. I am looking now at alternative models.

KamilSJaron commented 4 years ago

Given the smudgeplot and the kmer histogram, here is the list of all possible genome shapes I can think of sorted by "what I think" is more likely. The first two are both viable, the thrid one I see more farfetched:

AABB

H: The genome is tetraploid, with two relatively distinct monoploid genomes (~10% of divergence) and 250 Mbp each

Model (the one we already had): AABB

Expectations/Tests:

[x] Genome assembly will be ~500, featuring nearly complete sequences of A and B subgenomes
[ ] ~BUSCO should feature high proportion of duplicates~
[ ] within genome assembly the homeo/ohnologous sequences should be identifiable - easiest found using secondary hits of genes when blasted against the genome. The histogram of sequence similarities of the secondary hits is then expected to be unimodel, with the mode around the divergence of the haplotypes (10%)

AB

H: The genome is diploid, featuring an excessive amount of duplications, similar to one of the strawberries genomes.

Model:

Pvol_di_plot

Expectations/Tests:

[x] Genome assembly will be ~500, featuring nearly complete sequences of A and B subgenomes
[x] BUSCO should feature low proportion of duplicates
[ ] the genome assembly should not feature any genome-wide homeo/ohnologous structures. Histograms of secondary hits (paralogs) is not expected to have any particular modus, it can be spread
[ ] ~the heterozygosity should be very low~ in the case of strawberry, the reason paralogs dominate the signal is because of very low heterozygosity, this genome is expected to be fairly heterozygous (0.678%), therefore the dominance of paralogs is rather surprising

Note: This model is also quite consistent with Mercurialis (haploid genome ~600Mbp), another euphorbiaceae genome I happened to have my hands on. But I don't know the system well enough to say how indicative is that.

AAAAAA

H: Hexaploid genome that is largely homozygous with haploid genome size ~500M.

Model:

Pvol_hex_3_plot

Expectations/Tests:

[x] Genome assembly will be ~500
[x] BUSCO should feature low proportion of duplicates
[ ] the genome assembly should not feature any genome-wide homeo/ohnologous structures. Histograms of secondary hits (paralogs) is not expected to have any particular modus, it can be spread
[ ] when reads mapped back to the genome, at least some sites with more than two alleles should be found. Some of the bialleic loci will feature non-even copy number (5/1, 4/2), which can be tested using ratio of coverages supporting variants in SNP calling
[ ] DNA content of the cell should be ~3Gbp in total (6 500M; in contrast to the other models that predict ~1Gbp (2500 in diploid or 4*250 in tetraploid)). Are there any flow cytometry measurements available? I gave a quick search in the database of plant genome sizes, but could not find it.

This model has no support in the smudgeplot analysis. We would expect much richer genome structure if it's a hexaploid with haploid coverage 15x, but it's not impossible that the genome is so largerly homozygous that we just don't have any signal over there - but as I find. I find it hard to believe. I am including this model for completeness.

I hope this will provide you a finite set of models you can consider and disentangle downstream. Let me know if it makes sense to you.

sivico26 commented 4 years ago

Wow, I did not expect all those models, thank you very much Kamil for taking your time thinking around this, really.

I am not sure if I mentioned this before, I shall do it now. Genomescope2 website has a footnote saying organellar DNA can confuse the model, and that's why I reestimated using max kmer coverage. Originally, I did it with 500x and tetraploid model:

AABB - 500x, which reduces the genome size to 197 Mb.

So I tried to reestimate with your hypothesis to take this into account but I noticed that for the hexaploid model it interfered a bit, so I change it to 1000x. These are the results:

The only one I could adjust was the hexaploid one. I guess you used a local version instead of the webpage? Still, for the diploid and tetraploid cases, the genome sizes get a little bit shy of the assembly size.

However, before going there, do you think this correction in the estimation is necessary and/or changes the feasibility of the hypothesis?

On the other hand, considering the difficulty that the assembly represented, I would be really surprised if the genome ended being homozygous.

But well I think the best thing to do is to make the similarity analysis you suggested after we get the annotation done.

Thanks again for your help Kamil!

KamilSJaron commented 4 years ago

That's something I should have thought before, but I didnt!!!

Yes, organelles are certainly artificially inflating genome size estimate, BUT there can plenty of super repetitive sequences that are going to be ignored if you set up a threshold for the kmers. I have seen only a handful of examples where genome est was too big, but I have seen plenty of examples where a huge chunk was missing because of that. For example, look at this evaluation in Rooibos or this blog about superrepetitivness in the marbled crayfish. Could you please redo the kmer histogram so it contains all the kmer counts? You can set up the max number to something really really high (like 1000000000) and then filter out all rows that have 0 coverage to make the file smaller (I think something like awk '!/\t0/{ }' should do the job (didn't test it)). I suspect this could resolve our genome size problems...

Sometimes I wonder where the role of a developer starts and end Simón and if it's smart to be so keen on other people's data... But I don't know I just really like genomes, I am true genome-naturalist.

sivico26 commented 4 years ago

I did a recounting with KMC as you suggested using the instructions provided at the end of the marbled crayfish. Here it is the histogram:

kmer_k21_raw_full.hist.txt

Still, I do not know how to make that KMC_tools do not trim the hist at 10000, I do not find the flag that supposedly changes that default. If you can instruct me in this regard, it would be great. However, I did another Genomescope with this new histogram and the result is different, so it had to change something:

New_genomescope_tetraploid

I just find amazing to find people like you Kamil, that their passion leads them to engage in this way for some subjects.

KamilSJaron commented 4 years ago

I am sorry, that's the -cx argument and there is a mistake in the blogpost (it's fixed now).

kmc -k21 -t16 -m64 -ci1 -cs500000000 @FILES kmer_counts tmp
kmc_tools transform kmer_counts histogram kmer_k21_full_with_zeros.hist -cx500000000

but then it's going to be big. You can make it smaller by discarding all the coverages with no kmers (i.e. where the second column is 0) using awk

awk '{ if( $2 != 0 ){ print $0 } }' kmer_k21_full_with_zeros.hist > kmer_k21_full.hist

kmer_k21_full.hist is then the file we are interested in.

sivico26 commented 4 years ago

Dear Kamil, here is the full histogram:

new_kmer_k21_full.hist.txt

Actually I feel a bit awkward for not finding the appropriate option myself. I did again a preliminary genomescope plot and it really changed the genome size and repeat estimation. The dup parameter also increased almost by 1 unit, but I do not know what this metric actually means so I found it hard to interpret.

I also find difficult to differentiate the contribution of the organellar and may add uncertainty to our estimation. I already assembled the chloroplast genome and we also have a draft of the mitochondrion genome. Do you think it would be useful to map the reads to these and count again only with unmapped reads? I remember we did have a bad experience in #36 but maybe I made a mistake at some point in the process. Do you think it is worth to try?

KamilSJaron commented 4 years ago

Alright, I made a small investigation. When I looked at your results (the one you linked) I was this bump on the log log histogram, and I thought that is there is an obvious chloroplast in your sequencing, it will be this one. So I wanted to figure how much this affects the genome size estimate. This is a fig zoomed on the bumb I am talking about

So, I removed the bumb using a linear model from a different part of the kmer histogram (I will paste code by the end of this post)

This is WITH the bumb (yours):

your_genomescope

And this is WITHOUT the bumb (what I made now):

Pvol_tetra_wo_chlor_no_prior_plot

and here is the corresponding diploid model

Pvol_di_wo_chlor_plot

So, there is very small decrease of genome size estimate. Suggesting that if I am right about this being chloroplast, it does not affect the genome size estimate that much (~12Mbp for tetraploid, ~24M for diploid). Suggesting that the other repetitive kmers affect it much more (extra 90/180 Mbp). Sort of with agreement with the Roiboss study linked above. My guess would be some recent repeat expansion, hence the extreme repetition conservation (i.e. few sequences many time).

Fine, but what does it mean for the assembly?

The long post about different scenarios still kind of applies, but now I would stay support better the teraploid model, because even the genome structure is AABB, with substantial A <-> B divergence, there will be locally similar sequences that might get collapsed, especial in a genome that has such repetitive content. In the diploid model, you would be missing quite a big part of the genome, but most likely the repetitive one, which is not that suprising, nor that tragic if this was not the aim of your study in the first place.

I guess there is only very little I can do now. I would suggest two following analyses:

KAT to check which kmers are how many times in your assembly (it can look for instance like this)
the secondary blast hit analysis I recommended in this post.

KamilSJaron commented 4 years ago

The code:

kmer_tab = read.table('new_kmer_k21_full.hist.txt', col.names = c('cov', 'freq'))

# From 2800 to 5600 must be smoothed
plot(log10(kmer_tab$cov), log10(kmer_tab$freq), ylim = c(2, 4.5), xlim = c(2.45, 3.75))

# 3 - 3.3 training
training <- kmer_tab[log10(kmer_tab$cov) > 3 & log10(kmer_tab$cov) < 3.3,]
training$cov <- log10(training$cov)
training$freq <- log10(training$freq)
chloroplast <- kmer_tab[kmer_tab$cov > 2800 & kmer_tab$cov < 5600,]

trained_model <- lm(freq ~ cov, data = training)
predicted_values <- predict(trained_model, data.frame(cov = log10(chloroplast$cov)))

to_plot <- kmer_tab[log10(kmer_tab$cov) > 3 & kmer_tab$cov < 5600,]
plot(log10(to_plot$freq) ~ log10(to_plot$cov))
points(log10(chloroplast$cov) ~ predicted_values, col = 'red')
# abline(trained_model)

kmer_tab[kmer_tab$cov > 2800 & kmer_tab$cov < 5600, 'freq'] <- round(10^predicted_values)

plot(log10(kmer_tab$cov), log10(kmer_tab$freq), ylim = c(2, 4.5), xlim = c(2.45, 3.75))

write.table(kmer_tab, 'new_kmer_k21_without_putative_chloroplast.hist', col.names = F, row.names = F, sep = '\t')

KamilSJaron commented 3 years ago

closing due to inactivity, if anything feel free to reopen the issue

KamilSJaron / smudgeplot

Plukenetia volubilis genome size mistery #50