chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
534 stars 87 forks source link

continuity drops in Hi-C resolved contigs, compared with primary contigs #115

Open zhangrengang opened 3 years ago

zhangrengang commented 3 years ago

Here is an example: the y-axis is hic.p_ctg and the x-axis is hic.hap1.p_ctg or hic.hap2.p_ctg. Many primary contigs are break into several haplotype contigs, for exampe ptg000020l. The N50 drops from 23 Mb to 8 Mb. Is it abnormal? hifiasm hic p_ctg gfa fa-hifiasm hic hap1 p_ctg gfa fa paf gz hifiasm hic p_ctg gfa fa-hifiasm hic hap2 p_ctg gfa fa paf gz

chhylp123 commented 3 years ago

How about Hi-C contact map of two haplotypes? Do they look ok?

zhangrengang commented 3 years ago

I have only checked the primary contigs' Hi-C contact map that is ok with just two mis-joins. The two haplotypes' Hi-C contact maps should be similar according to their colliearity to primary contigs.

baozg commented 3 years ago

primary HiC contact map cannot distiguish the haplotype switch. But if you mapping the HiC data to the hic.hap1 + hic.hap2, you may find the switch,

zhangrengang commented 3 years ago

Hi @baozg, I concern about continuity drops after Hi-C, but not haplotype switch. Continuity drops make haplotypes harder to assemble to chromosomes and lead to more gaps. But actually I do not check whether there are haplotype switch errors. I will test it.

chhylp123 commented 3 years ago

It depends on data. There are some cases that one haplotype has coverage while another does not have. For primary contig, hifiasm can assemble through those regions at the expense of introducing haplotype switch, but Hi-C phased contigs should not do that. This is why phased contigs has continuity drops.

zhangrengang commented 3 years ago

How will Hi-C phased contigs do with those regions? Absent in both phased contigs or only present in one of the phased contigs? How can I impove continuity for phased contigs.

chhylp123 commented 3 years ago

We are trying to solve it for now and hopefully can fix it in the next version. By the way, does it really affect a lot for the downstream scaffolding tools?

zhangrengang commented 3 years ago

It may not affect a lot as the N50 is still quite high. I will have a try. Here I extract an example of alignments between haplotigs and primary contigs:

# hap1
h1tg000008l     10524551        8695    10524551        +       ptg000030l      20942980        8695    11215641        7961384 12036962 60      tp:A:P  cm:i:732386     s1:i:7485968    s2:i:49363      dv:f:0.0109     rl:i:1388
h1tg000054l     9192164 18      9192149 +       ptg000030l      20942980        11218343        20438050        8520994 9291240 60       tp:A:P  cm:i:796093     s1:i:8485628    s2:i:25984      dv:f:0.0018     rl:i:1218
h1tg000209l     507464  20      507447  -       ptg000030l      20942980        20436657        20942960        390521  509798  60       tp:A:P  cm:i:31929      s1:i:389325     s2:i:6594       dv:f:0.0111     rl:i:147

# hap2
h2tg000016l     11469638        8695    11469638        +       ptg000030l      20942980        8695    12249629        9462725 12769023 60      tp:A:P  cm:i:877820     s1:i:9091036    s2:i:25998      dv:f:0.0069     rl:i:1449
h2tg000127l     2074218 12      2074200 -       ptg000030l      20942980        12250592        14415054        1684151 2203449 60       tp:A:P  cm:i:141199     s1:i:1649762    s2:i:15890      dv:f:0.0130     rl:i:252
h2tg000091l     566036  28      566019  -       ptg000030l      20942980        14415310        15011408        491210  598877  60       tp:A:P  cm:i:41341      s1:i:483832     s2:i:4163       dv:f:0.0106     rl:i:42
h2tg000069l     5724133 12      5724125 -       ptg000030l      20942980        15011457        20727664        5099799 5726944 60       tp:A:P  cm:i:434613     s1:i:5096979    s2:i:26407      dv:f:0.0078     rl:i:882
h2tg000245l     205063  19      205049  -       ptg000030l      20942980        20736567        20941553        166266  205059  60       tp:A:P  cm:i:13781      s1:i:166239     s2:i:1960       dv:f:0.0113     rl:i:42
chhylp123 commented 3 years ago

How about the N50s of hap1 and hap2 generated by hifiasm in default? By coverage, I mean the coverage of HiFi reads, so that it is hard to figure out by contig alignment.

zhangrengang commented 3 years ago

The N50s:

hifiasm.hic.hap1.p_ctg.gfa.fa.stats:  N50       8105712 bp              L50     32
hifiasm.hic.hap2.p_ctg.gfa.fa.stats:  N50       8813861 bp              L50     27
hifiasm.hic.p_ctg.gfa.fa.stats:  N50    23264050 bp             L50     10

Do you mean Hi-C reads? I am mapping them and will reply after done.

chhylp123 commented 3 years ago

No, I mean HiFi reads.

zhangrengang commented 3 years ago

You mean one of the haplotypes is not covered and assembled by HiFi reads? I will check the coverage of break points, such as ptg000030l:11215641-11218343. There are indeed some unexcepted dead ends in the unitig graph. For example, the below two unitigs are excepted to be linked but actually not. This makes hap1 breaks at ptg000030l:11215641-11218343.

utg000054l      1192086 0       1192076 -       ptg000030l      20942980        9853788 11215641        773421  1527337 60      tp:A:P   cm:i:62483      s1:i:672361     s2:i:15205      dv:f:0.0232     rl:i:168
utg002085l      1021715 18      1021696 +       ptg000030l      20942980        11218343        12262922        733573  1103871 60       tp:A:P  cm:i:63270      s1:i:704229     s2:i:5945       dv:f:0.0156     rl:i:126

360截图17040510403470

baozg commented 3 years ago

@zhangrengang Just another issue. hifiasm may missing untig of dead end. In a plant trio assembly, it missed three partenal unitig (3Mb) in the pat.p_ctg. So does these dead end untig keep in the hap1.p_ctg or hap2.p_ctg ? You can use the reads id to find the correaltion between the noseq.gfa

The yellow untigs are the missing one

zhangrengang commented 3 years ago

@baozg They keep in one of hap1 or hap2, which introduces dead ends into Hi-C phased contigs. It is why continuity drops after Hi-C phasing. It is not issue of Hi-C phasing, but issue of sequencing or graph conduction. I do not observed large missing unitigs of dead end.

zhangrengang commented 3 years ago

Coverage by HiFi reads decreases at ptg000030l:11210810 and there are many reads clipped at the position. 360截图17340912346543

chhylp123 commented 3 years ago

Then it should be the data problem.