chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
540 stars 87 forks source link

order in the trio files impacts assembly results #81

Open chklopp opened 3 years ago

chklopp commented 3 years ago

We tested both orders for parental yak files for a 4Gb genome and got different results (hifiasm-0.12)

-1 mat.yak -2 pat.yak 3,3G 19 févr. 10:23 hifiasm_h1_h2.hap1.p_ctg.gfa 4,2G 19 févr. 10:25 hifiasm_h1_h2.hap2.p_ctg.gfa

-2 mat.yak -1 pat.yak 3,7G 6 mars 17:02 hifiasm_h2_h1.hap1.p_ctg.gfa 3,0G 6 mars 17:03 hifiasm_h2_h1.hap2.p_ctg.gfa

The total genome size is different as well a the haplotyped assembly sizes. We expected to have the same inverted results.

How to chose the best order?

chhylp123 commented 3 years ago

Could you please rerun with v0.14 and see the results? v0.14 has updated trio mode so that the results should be better. Besides, the difference between two haplotypes is too large so I worry there is something wrong. Is your parental data reliable?

chklopp commented 3 years ago

With version 0.14 we do not see the same differences in genome sizes when inverting parental yak files

-1 mat.yak -2 pat.yak 3.7G Mar 10 11:07 hifiasm_h1_h2.hap1.p_ctg.gfa 3.0G Mar 10 11:09 hifiasm_h1_h2.hap2.p_ctg.gfa

-2 mat.yak -1 pat.yak 3.0G Mar 10 15:59 hifiasm_h2_h1.hap1.p_ctg.gfa 3.7G Mar 10 16:01 hifiasm_h2_h1.hap2.p_ctg.gfa

Does this indicate that our parental data is reliable?

chhylp123 commented 3 years ago

Have you evaluated the phasing results with yak trioeval? I'm not pretty sure since paternal assembly is much larger than maternal assembly. Is this the feature of your sample? Usually if yak trioeval reports low hamming error rate/switch error rate, it should be right.

chklopp commented 3 years ago

I just ran trioeval. Here are beginning and end of the result file

S h1tg000001l 7571 41463 5624 1946 1946 39517 S h1tg000002l 924 9837 498 426 425 9411 S h1tg000003l 567 3800 336 230 231 3569 S h1tg000004l 4603 22246 3082 1520 1521 20725 S h1tg000005l 17255 18362 13753 3501 3501 14861 S h1tg000006l 8842 16259 7206 1635 1636 14623 S h1tg000007l 6151 19130 4440 1710 1711 17419 S h1tg000008l 1277 4131 955 321 322 3809 S h1tg000009l 2923 16358 1828 1094 1094 15264 S h1tg000010l 2503 5559 1959 543 544 5015

S h1tg005546l 56 2 54 1 1 1 S h1tg005547l 28 3 26 1 1 2 S h1tg005548l 42 5 39 2 2 3 S h1tg005549l 2 0 1 0 0 0 S h1tg005550l 2 293 0 2 2 290 S h1tg005551l 6 3 4 1 2 1 S h1tg005552l 1 0 0 0 0 0 S h1tg005553l 4 2 3 0 1 1 W 826661 8995052 0.091902 H 1622964 8999853 0.180332

How do I interpret these figures?

chhylp123 commented 3 years ago

W 826661 8995052 0.091902

is the the switch error rate.

H 1622964 8999853 0.180332

is the hamming error rate.

The hamming error rate is too high. How do you get the parental data?

chklopp commented 3 years ago

From short reads. What is the expected range and the possible meanings of this value? Can the fact that the genome underwent a recent whole genome duplication have an impact on this metric?

chklopp commented 3 years ago

Just one more thing. The assembly I provided to trioeval is hap1 is this OK or should I concatenate both haps?

lh3 commented 3 years ago

The phasing error rate is fairly high. Reiterating Haoyu's question:

How do you get the parental data?

Are you sure the parental data are correct?

chklopp commented 3 years ago

From parental short reads. Quite sure, I did not perform the sampling, library preparation and sequencing myself.

chhylp123 commented 3 years ago

Sorry for the late reply. It seems to be not right. What's the size of primary assembly generated by hifiasm? Usually it won't be such high hamming error rate.

chklopp commented 3 years ago

Here are the metrics of the primary assembly Number of scaffolds 6957 Total size of scaffolds 3191815322 N50 scaffold length 5408000
L50 scaffold count 125

chhylp123 commented 3 years ago

I have no idea. But I still think it is more likely that the parental data has some issues. Based on the primary assembly size of your sample, the size of each haplotype should be around 3.2-3.3Gb. However, both the hamming error rate and the haplotype-resolved genome size indicate that the trio-binning phasing failed.

chhylp123 commented 3 years ago

Wait.... Could you please rerun hifiasm with current github HEAD? It fixed a relatively serious bug in trio-binning mode. It might be helpful to get two haplotypes with similar size but probably still cannot fix the high hamming error rate issue.