Open zhangrengang opened 3 years ago
How about Hi-C contact map of two haplotypes? Do they look ok?
I have only checked the primary contigs' Hi-C contact map that is ok with just two mis-joins. The two haplotypes' Hi-C contact maps should be similar according to their colliearity to primary contigs.
primary HiC contact map cannot distiguish the haplotype switch. But if you mapping the HiC data to the hic.hap1 + hic.hap2, you may find the switch,
Hi @baozg, I concern about continuity drops after Hi-C, but not haplotype switch. Continuity drops make haplotypes harder to assemble to chromosomes and lead to more gaps. But actually I do not check whether there are haplotype switch errors. I will test it.
It depends on data. There are some cases that one haplotype has coverage while another does not have. For primary contig, hifiasm can assemble through those regions at the expense of introducing haplotype switch, but Hi-C phased contigs should not do that. This is why phased contigs has continuity drops.
How will Hi-C phased contigs do with those regions? Absent in both phased contigs or only present in one of the phased contigs? How can I impove continuity for phased contigs.
We are trying to solve it for now and hopefully can fix it in the next version. By the way, does it really affect a lot for the downstream scaffolding tools?
It may not affect a lot as the N50 is still quite high. I will have a try. Here I extract an example of alignments between haplotigs and primary contigs:
# hap1
h1tg000008l 10524551 8695 10524551 + ptg000030l 20942980 8695 11215641 7961384 12036962 60 tp:A:P cm:i:732386 s1:i:7485968 s2:i:49363 dv:f:0.0109 rl:i:1388
h1tg000054l 9192164 18 9192149 + ptg000030l 20942980 11218343 20438050 8520994 9291240 60 tp:A:P cm:i:796093 s1:i:8485628 s2:i:25984 dv:f:0.0018 rl:i:1218
h1tg000209l 507464 20 507447 - ptg000030l 20942980 20436657 20942960 390521 509798 60 tp:A:P cm:i:31929 s1:i:389325 s2:i:6594 dv:f:0.0111 rl:i:147
# hap2
h2tg000016l 11469638 8695 11469638 + ptg000030l 20942980 8695 12249629 9462725 12769023 60 tp:A:P cm:i:877820 s1:i:9091036 s2:i:25998 dv:f:0.0069 rl:i:1449
h2tg000127l 2074218 12 2074200 - ptg000030l 20942980 12250592 14415054 1684151 2203449 60 tp:A:P cm:i:141199 s1:i:1649762 s2:i:15890 dv:f:0.0130 rl:i:252
h2tg000091l 566036 28 566019 - ptg000030l 20942980 14415310 15011408 491210 598877 60 tp:A:P cm:i:41341 s1:i:483832 s2:i:4163 dv:f:0.0106 rl:i:42
h2tg000069l 5724133 12 5724125 - ptg000030l 20942980 15011457 20727664 5099799 5726944 60 tp:A:P cm:i:434613 s1:i:5096979 s2:i:26407 dv:f:0.0078 rl:i:882
h2tg000245l 205063 19 205049 - ptg000030l 20942980 20736567 20941553 166266 205059 60 tp:A:P cm:i:13781 s1:i:166239 s2:i:1960 dv:f:0.0113 rl:i:42
How about the N50s of hap1 and hap2 generated by hifiasm in default? By coverage, I mean the coverage of HiFi reads, so that it is hard to figure out by contig alignment.
The N50s:
hifiasm.hic.hap1.p_ctg.gfa.fa.stats: N50 8105712 bp L50 32
hifiasm.hic.hap2.p_ctg.gfa.fa.stats: N50 8813861 bp L50 27
hifiasm.hic.p_ctg.gfa.fa.stats: N50 23264050 bp L50 10
Do you mean Hi-C reads? I am mapping them and will reply after done.
No, I mean HiFi reads.
You mean one of the haplotypes is not covered and assembled by HiFi reads? I will check the coverage of break points, such as ptg000030l:11215641-11218343. There are indeed some unexcepted dead ends in the unitig graph. For example, the below two unitigs are excepted to be linked but actually not. This makes hap1 breaks at ptg000030l:11215641-11218343.
utg000054l 1192086 0 1192076 - ptg000030l 20942980 9853788 11215641 773421 1527337 60 tp:A:P cm:i:62483 s1:i:672361 s2:i:15205 dv:f:0.0232 rl:i:168
utg002085l 1021715 18 1021696 + ptg000030l 20942980 11218343 12262922 733573 1103871 60 tp:A:P cm:i:63270 s1:i:704229 s2:i:5945 dv:f:0.0156 rl:i:126
@zhangrengang Just another issue. hifiasm may missing untig of dead end. In a plant trio assembly, it missed three partenal unitig (3Mb) in the pat.p_ctg. So does these dead end untig keep in the hap1.p_ctg
or hap2.p_ctg
? You can use the reads id to find the correaltion between the noseq.gfa
The yellow untigs are the missing one
@baozg They keep in one of hap1 or hap2, which introduces dead ends into Hi-C phased contigs. It is why continuity drops after Hi-C phasing. It is not issue of Hi-C phasing, but issue of sequencing or graph conduction. I do not observed large missing unitigs of dead end.
Coverage by HiFi reads decreases at ptg000030l:11210810 and there are many reads clipped at the position.
Then it should be the data problem.
Here is an example: the y-axis is
hic.p_ctg
and the x-axis ishic.hap1.p_ctg
orhic.hap2.p_ctg
. Many primary contigs are break into several haplotype contigs, for exampeptg000020l
. The N50 drops from 23 Mb to 8 Mb. Is it abnormal?