chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
526 stars 86 forks source link

Identical contigs in primary and alternate #124

Open ctxchris opened 3 years ago

ctxchris commented 3 years ago

Hi,

I ran hifiasm in diploid mode with default parameters on a diploid plant genome with Hifi data. Primary and alternate contig size is as expected and almost the same. I have a few large primary contigs (10MB - 300MB) that have an exact duplicate in the alternate contigs, exact same length, not a single SNP InDel etc. difference. Are they being phased based on coverage? Information from the parents showed that those contigs actually belong to the same phase.

Thanks Chris

chhylp123 commented 3 years ago

Just make sure, do you mean there are identical contigs between p_ctg & a_ctg, or bp.hap1 & bp.hap2?

ctxchris commented 3 years ago

Between hap1 and hap2 There is no a_ctg, I had hifi reads as input We see a couple of really large "duplicated" contigs that are put into different phases by hifiasm. Some show extensive variation some show no variation at all. Parental information suggests that each "duplicate pair" actually belongs to the same phase. We're now trying to determine if we see real duplication events in the genome or if they are assembly artifacts. Would Hi-C data help contig de-cuplication or rather to distingish between real and artificial duplication?

ctxchris commented 3 years ago

Update: We included HiC data and run it on the overlaps from the previous assembly without HiC. Some of the duplicated contigs "disappeared", so there's only one copy now. Most, but not all, of the duplicates that were "removed" have only half of the read coverage the other duplicates have. I think that makes sense. The total assembly size (phase1 + phase2) is about 600MB smaller than before without HiC. One odd thing: One 33MB duplicate contig pair is missing completely in the HiC version, meaning there is no corresponding contig for either of the two. How can that be? Another question: Some "former" duplicates showed variation. The deduplicated new contig shows variation to only one of the former duplicates. How did Hifiasm deal with the variation. To make that clearer: Initial assembly: Duplicate pair contig 1 and contig 2 show 100 variants HiC assembly: Former contig 1 and contig 2 are "merged" into contig 3. Contig 3 shows variation to former contig 1 but not to former contig 2. Is there more information on how the HiC data is integrated? And would it makes sense to start a fresh assembly with the HiC data, so not using the previous overlaps?

lh3 commented 3 years ago

I can see hifiasm behaving unexpectedly, but I don't understand your description. Sorry. More specifically, what is a "duplicate"? In two haplotypes or in one haplotype? You started with "exact duplicate" but then talked about 100 variants. Are duplicates exact or not? Does an initial assembly refer to an assembly without Hi-C? If so, how do you know two contigs are merged? Assemblies with and without Hi-C are different and can't be easily compared.

ctxchris commented 3 years ago

Sorry for the confusion. In the initial assembly (without HiC), we sorted contigs based on parental patterns into parent-specific haplotypes. We have pairs of contigs that show the same parental pattern and usually those pairs were put into different haplotypes by hifiasm, but actually belong to the same haplotype based on the parental analysis. Some of those pairs are identical in length and sequence. Some show the same distinct pattern, have a very similar length but may differ in a few up to several hundred SNPs or SVs. Both of those I called "duplicate". In the assembly including the HiC data, we don't see most of those contig pairs anymore, but only one. We aligned the initial contigs back to the HiC version and I call a contig pair merged, if there is only one representative for both contigs in the HiC version. We also see the distinct parental pattern only in one contig where we previously saw two. Would you recommend to start the assembly anew with the HiC data or is it fine to run it on the overlaps from the initial assembly?

lh3 commented 3 years ago

In the initial assembly (without HiC)

The assembly without Hi-C is not an "initial assembly". The Hi-C assembly is not derived from that at all.

usually those pairs were put into different haplotypes by hifiasm, but actually belong to the same haplotype based on the parental analysis

Without Hi-C, it is often not possible to achieve long-range phasing, meaning that hifiasm (or whatever assemblers) often mixes phases in one contig.

We aligned the initial contigs back to the HiC version and I call a contig pair merged

You don't need to do that. When you have Hi-C, focus on the Hi-C assembly only and ignore the non-Hi-C assembly.