alekseyzimin / masurca

GNU General Public License v3.0
243 stars 35 forks source link

Double genome size, ploidy 1 #145

Closed elbourret closed 4 years ago

elbourret commented 4 years ago

Masurca estimates a genome size double the expected number, but Ploidy 1. Is this fine?

I'm assembling a highly heterozygous plant genome using 120X Illumina PE 150 bp, and 98X Nanopore data (N50 17000 bp).

I measured a 454 Mb haploid size with flow-cytometry. GenomeScope estimates 355 Mb, with 0.5% heterozygosity and 1.4% duplicated. The species is paleo-tetraploid (polyploidy happened 10-20 Mya), but largely diploidized (chromosome number doubled, but genome size similar to ancestral size).

Interestingly, Masurca estimated a genome size of 820 Mb, but Ploidy 1.

I'm not sure how to interpret this. Shouldn't the expected behavior be to estimate the correct genome size, and Ploidy 2?

I'm also surprised because the heterozygosity estimated by GenomeScope was not that high (only 0.5%).

Would it be better for me to manually set the Ploidy as 2, and thus "force" Masurca to estimate the correct genome size, or should I leave it run as is?

jleluyer commented 4 years ago

Hi,

I observed the same pattern (Illumina PE 150bp and PacBio 45X, ~9kb) for highly heterozygous invertebrates genome. PLOIDY=1 and nearly twice the expected genome size. Did you end up forcing PLOIDY=2 ?

Best,

elbourret commented 4 years ago

I'm almost finished assembling with PLOIDY=1, still trying to fix some unrelated problem with Flye. I plan to try assembling with PLOIDY=2 on the same dataset and to report the results I get with each setting. It should take 2 or 3 weeks.

alekseyzimin commented 4 years ago

If the genome is very heterozygous, it looks to assembler as one genome with double the size, as opposed to two similar copies of the same genome.

jleluyer commented 4 years ago

Thanks Aleksey, so should we force PLOIDY=2 ?

alekseyzimin commented 4 years ago

No, do not worry about this, the assembler should in this case treat the genome as PLOIDY 1

On Wed, Jan 15, 2020 at 5:50 PM jleluyer notifications@github.com wrote:

Thanks Aleksey, so should we force PLOIDY=2 ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/masurca/issues/145?email_source=notifications&email_token=AGPXGHKD2RGUXS4LLCKGN4TQ56HKVA5CNFSM4JV3GHZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJCC5TI#issuecomment-574893773, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGPXGHJ7WM4L6YP4KJWU22LQ56HKVANCNFSM4JV3GHZA .

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 website http://ccb.jhu.edu/people/alekseyz/ blog http://masurca.blogspot.com

elbourret commented 4 years ago

I am assembling two genomes of the same species from the same population, varying only in flower mating-type (heteromorphic self-incompatibility).

Each genome was estimated as PLOIDY=1, but the estimated genome sizes are respectively 1.5x and 2x times the expected size. Is that also normal? I guess that the genome with the smaller estimated size is less heterozygous than the other, but then, why doesn't masurca use PLOIDY=2 in the less-heterozygous sample?

What exactly does PLOIDY=2 change during the analysis? If I want to test assembling with PLOIDY=2, can I reuse some of the intermediate files from the analysis with PLOIDY=1?

alekseyzimin commented 4 years ago

PLOIDY=2 increases the aggressiveness in filtering out redundant contigs. With PLOIDY=2 any contig that is contained in another bigger contig with 90% alignment identity will be filtered out.

On Thu, Jan 16, 2020 at 9:41 AM Étienne Léveillé-Bourret < notifications@github.com> wrote:

I am assembling two genomes of the same species from the same population, varying only in flower mating-type (heteromorphic self-incompatibility).

Each genome was estimated as PLOIDY=1, but the estimated genome sizes are respectively 1.5x and 2x times the expected size. Is that also normal? I guess that the genome with the smaller estimated size is less heterozygous than the other, but then, why doesn't masurca use PLOIDY=2 in the less-heterozygous sample?

What exactly does PLOIDY=2 change during the analysis? If I want to test assembling with PLOIDY=2, can I reuse some of the intermediate files from the analysis with PLOIDY=1?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/masurca/issues/145?email_source=notifications&email_token=AGPXGHLTOR4HY7QQRZZYUBLQ6BWZHA5CNFSM4JV3GHZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJEJJFY#issuecomment-575181975, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGPXGHK2H75JH5UOYOKQY5TQ6BWZHANCNFSM4JV3GHZA .

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 website http://ccb.jhu.edu/people/alekseyz/ blog http://masurca.blogspot.com

elbourret commented 4 years ago

No change to super-reads or mega-reads? So I can reuse the same mega-reads in a PLOIDY=2 assembly?

alekseyzimin commented 4 years ago

No change, you can reuse the same data.

On Thu, Jan 16, 2020 at 9:48 AM Étienne Léveillé-Bourret < notifications@github.com> wrote:

No change to super-reads or mega-reads? So I can reuse the same mega-reads in a PLOIDY=2 assembly?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/masurca/issues/145?email_source=notifications&email_token=AGPXGHJEJ7TOFDG3RHSH3S3Q6BXTRA5CNFSM4JV3GHZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJEKBWI#issuecomment-575185113, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGPXGHIHSUIKBUCAPLGMXWDQ6BXTRANCNFSM4JV3GHZA .

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 website http://ccb.jhu.edu/people/alekseyz/ blog http://masurca.blogspot.com

elbourret commented 4 years ago

I was finally able to finish the masurca hybrid assembly, using PLOIDY=1 and the genome size estimated by masurca (twice the real genome size). Using 35X long read coverage and Flye assembler, I got an excellent result that is very close to the expect haploid size of the genome.

The masurca assembly (contig N50 800kb) is less contiguous than a Canu assembly (~3Gb) based on 60X nanopore. However, the masurca assembly contains less duplicated contigs, and is missing almost no k-mers on KAT k-mer spectra (vs missing many k-mers from the diploid peak, in the Canu assembly), so it is apparently more complete and accurate.

gushiro commented 2 years ago

@elbourret how about the result with ploidy=2?