chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
528 stars 86 forks source link

low contig N50 for a plant genome with ~100x HiFi reads #95

Open JiaoLaboratory opened 3 years ago

JiaoLaboratory commented 3 years ago

I am assembling a plant genome (~250M) with hifiasm. First, I extract the ccs reads using : ./bam2fasta D01.ccs.bam -o D01.ccs.fasta ( Is it right for extract hifi reads ?)

The total CCS reads (D01.ccs.fasta ) have achieved to ~25g (median size ~15kb), which is about ~100x coverage of the whole genome.

  1. My genomscope estimation from the Illumina survey data is: 246M, 0.53% het, repeat 1.78%( k=31)

fc17bbd7b3f88de4714979afe045534

  1. I run hifiasm as: ./hifiasm -o D01 -t40 -z20 -l0 D01.ccs.fasta.gz [M::ha_analyze_count] lowest: count[7] = 1070066 [M::ha_analyze_count] highest: count[83] = 3422475 [M::ha_hist_line] 2: ****> 8582511 [M::ha_hist_line] 3: ** 2544542 [M::ha_hist_line] 4: ***** 1465034 [M::ha_hist_line] 5: ** 1159787 [M::ha_hist_line] 6: **** 1102859 [M::ha_hist_line] 7: * 1070066 [M::ha_hist_line] 8: * 1126174 [M::ha_hist_line] 9: ** 1177591 [M::ha_hist_line] 10: **** 1242549 [M::ha_hist_line] 11: *** 1335486 [M::ha_hist_line] 12: * 1392200 [M::ha_hist_line] 13: ** 1447494 [M::ha_hist_line] 14: *** 1474424 [M::ha_hist_line] 15: * 1466899 [M::ha_hist_line] 16: ** 1452319 [M::ha_hist_line] 17: ** 1447038 [M::ha_hist_line] 18: ** 1433518 [M::ha_hist_line] 19: **** 1379260 [M::ha_hist_line] 20: * 1351781 [M::ha_hist_line] 21: ** 1314744 [M::ha_hist_line] 22: ** 1312702 [M::ha_hist_line] 23: ** 1287873 [M::ha_hist_line] 24: ** 1310804 [M::ha_hist_line] 25: ** 1300939 [M::ha_hist_line] 26: ** 1302460 [M::ha_hist_line] 27: **** 1363986 [M::ha_hist_line] 28: * 1411913 [M::ha_hist_line] 29: *** 1456777 [M::ha_hist_line] 30: **** 1493614 [M::ha_hist_line] 31: * 1556805 [M::ha_hist_line] 32: *** 1613952 [M::ha_hist_line] 33: ** 1699784 [M::ha_hist_line] 34: * 1747980 [M::ha_hist_line] 35: *** 1828847 [M::ha_hist_line] 36: * 1890213 [M::ha_hist_line] 37: **** 1921154 [M::ha_hist_line] 38: ** 1989868 [M::ha_hist_line] 39: ***** 2031391 [M::ha_hist_line] 40: **** 2048604 [M::ha_hist_line] 41: **** 2047496 [M::ha_hist_line] 42: * 2073141 [M::ha_hist_line] 43: ** 2052986 [M::ha_hist_line] 44: * 2011888 [M::ha_hist_line] 45: * 1965258 [M::ha_hist_line] 46: ** 1924167 [M::ha_hist_line] 47: * 1877411 [M::ha_hist_line] 48: * 1816943 [M::ha_hist_line] 49: **** 1776930 [M::ha_hist_line] 50: ** 1719916 [M::ha_hist_line] 51: * 1661281 [M::ha_hist_line] 52: 1620169 [M::ha_hist_line] 53: 1599061 [M::ha_hist_line] 54: * 1594456 [M::ha_hist_line] 55: ** 1566739 [M::ha_hist_line] 56: ** 1561549 [M::ha_hist_line] 57: ** 1583826 [M::ha_hist_line] 58: ** 1585321 [M::ha_hist_line] 59: * 1607981 [M::ha_hist_line] 60: **** 1654549 [M::ha_hist_line] 61: ** 1707507 [M::ha_hist_line] 62: **** 1768980 [M::ha_hist_line] 63: ** 1853242 [M::ha_hist_line] 64: **** 1930437 [M::ha_hist_line] 65: * 2030973 [M::ha_hist_line] 66: ** 2115228 [M::ha_hist_line] 67: *** 2209336 [M::ha_hist_line] 68: * 2284276 [M::ha_hist_line] 69: *** 2369543 [M::ha_hist_line] 70: **** 2457390 [M::ha_hist_line] 71: ** 2549635 [M::ha_hist_line] 72: ** 2656888 [M::ha_hist_line] 73: **** 2745392 [M::ha_hist_line] 74: **** 2861229 [M::ha_hist_line] 75: ** 2951134 [M::ha_hist_line] 76: * 3045075 [M::ha_hist_line] 77: *** 3117627 [M::ha_hist_line] 78: ** 3205273 [M::ha_hist_line] 79: **** 3283038 [M::ha_hist_line] 80: * 3335383 [M::ha_hist_line] 81: *** 3373213 [M::ha_hist_line] 82: * 3404234 [M::ha_hist_line] 83: **** 3422475 [M::ha_hist_line] 84: * 3381901 [M::ha_hist_line] 85: ** 3343577 [M::ha_hist_line] 86: **** 3280499 [M::ha_hist_line] 87: ** 3230718 [M::ha_hist_line] 88: *** 3181836 [M::ha_hist_line] 89: ** 3079834 [M::ha_hist_line] 90: ** 2955895 [M::ha_hist_line] 91: * 2854915 [M::ha_hist_line] 92: ***** 2719590 [M::ha_hist_line] 93: **** 2598707 [M::ha_hist_line] 94: **** 2475612 [M::ha_hist_line] 95: **** 2324640 [M::ha_hist_line] 96: * 2170301 [M::ha_hist_line] 97: **** 2035533 [M::ha_hist_line] 98: 1893794 [M::ha_hist_line] 99: ***** 1731699 [M::ha_hist_line] 100: ** 1580927 [M::ha_hist_line] 101: ** 1429793 [M::ha_hist_line] 102: ** 1291054 [M::ha_hist_line] 103: ** 1165784 [M::ha_hist_line] 104: * 1054571 [M::ha_hist_line] 105: **** 944947 [M::ha_hist_line] 106: **** 836723 [M::ha_hist_line] 107: ** 733451 [M::ha_hist_line] 108: 640360 [M::ha_hist_line] 109: **** 556540 [M::ha_hist_line] 110: ** 477577 [M::ha_hist_line] 111: **** 418931 [M::ha_hist_line] 112: ** 365931 [M::ha_hist_line] 113: 313910 [M::ha_hist_line] 114: **** 268460 [M::ha_hist_line] 115: * 229557 [M::ha_hist_line] 116: **** 203363 [M::ha_hist_line] 117: * 176964 [M::ha_hist_line] 118: 153004 [M::ha_hist_line] 119: 131473 [M::ha_hist_line] 120: 117893 [M::ha_hist_line] 121: 104808 [M::ha_hist_line] 122: 92985 [M::ha_hist_line] 123: 83603 [M::ha_hist_line] 124: 76969 [M::ha_hist_line] 125: 71612 [M::ha_hist_line] 126: 67939 [M::ha_hist_line] 127: 65368 [M::ha_hist_line] 128: 61719 [M::ha_hist_line] 129: 60122 [M::ha_hist_line] 130: 58246 [M::ha_hist_line] 131: 55626 [M::ha_hist_line] 132: 53265 ... ... [M::ha_hist_line] 209: 17529 [M::ha_hist_line] 210: 17648 [M::ha_hist_line] 211: 17369 [M::ha_hist_line] 212: 17456 [M::ha_hist_line] 213: 17326 [M::ha_hist_line] 214: * 17708 [M::ha_hist_line] rest: ***** 2986097 [M::ha_analyze_count] left: count[42] = 2073141 [M::ha_analyze_count] right: none [M::ha_ft_gen] peak_hom: 83; peak_het: 42 [M::ha_ft_gen::900.5307.70@20.758GB] ==> filtered out 1400274 k-mers occurring 415 or more times [M::ha_opt_update_cov] updated max_n_chain to 415 [M::ha_pt_gen::1415.7557.02] ==> counted 22518887 distinct minimizer k-mers [M::ha_pt_gen] count[4095] = 0 (for sanity check) [M::ha_analyze_count] lowest: count[9] = 60202 [M::ha_analyze_count] highest: count[83] = 137767 ... ...

[M::ha_hist_line] rest: ***** 56465 [M::ha_analyze_count] left: count[42] = 90940 [M::ha_analyze_count] right: none [M::ha_pt_gen] peak_hom: 83; peak_het: 42 [M::ha_pt_gen::16376.89436.41] ==> indexed 607603306 positions [M::ha_assemble::17210.47236.58@48.516GB] ==> found overlaps for the final round [M::ha_print_ovlp_stat] # overlaps: 193909791 [M::ha_print_ovlp_stat] # strong overlaps: 50725674 [M::ha_print_ovlp_stat] # weak overlaps: 143184117 [M::ha_print_ovlp_stat] # exact overlaps: 190321162 [M::ha_print_ovlp_stat] # inexact overlaps: 3588629 [M::ha_print_ovlp_stat] # overlaps without large indels: 193798087 [M::ha_print_ovlp_stat] # reverse overlaps: 38380556 Writing reads to disk... Reads has been written. Writing ma_hit_ts to disk... ma_hit_ts has been written. Writing ma_hit_ts to disk... ma_hit_ts has been written. bin files have been written. [M::purge_dups] purge duplication coverage threshold: 103 [M::purge_dups] purge duplication coverage threshold: 103 [M::purge_dups] purge duplication coverage threshold: 103 Writing raw unitig GFA to disk... Writing processed unitig GFA to disk... [M::purge_dups] purge duplication coverage threshold: 103 [M::adjust_utg_by_primary] primary contig coverage range: [69, infinity] Writing primary contig GFA to disk... Writing alternate contig GFA to disk... Inconsistency threshold for low-quality regions in BED files: 70% [M::main] Version: 0.14.2-r315 [M::main] CMD: hifiasm -o D01 -t40 -z20 -l0 D01.ccs.fasta.gz [M::main] Real time: 17717.850 sec; CPU: 630129.383 sec; Peak RSS: 48.516 GB

And Hifiasm seems to regards the hom and het peak (83 and 42, respectively) correctly.

  1. I got my final stat: >less D01.p_ctg.gfa.fa.stat: Main genome scaffold total: 1848 Main genome contig total: 1848 Main genome scaffold sequence total: 506.7 MB Main genome contig sequence total: 506.7 MB (-> 0.0% gap) Main genome scaffold N/L50: 136/786.8 KB Main genome contig N/L50: 136/786.8 KB Number of scaffolds > 50 KB: 1016 % main genome in scaffolds > 50 KB: 94.3% Minimum Number Number Total Total Scaffold Scaffold of of Scaffold Contig Contig Length Scaffolds Contigs Length Length Coverage

    All 1,848 1,848 506,704,342 506,704,342 100.00% 1 kb 1,848 1,848 506,704,342 506,704,342 100.00% 2.5 kb 1,848 1,848 506,704,342 506,704,342 100.00% 5 kb 1,848 1,848 506,704,342 506,704,342 100.00% 10 kb 1,848 1,848 506,704,342 506,704,342 100.00% 25 kb 1,780 1,780 505,253,812 505,253,812 100.00% 50 kb 1,016 1,016 477,928,721 477,928,721 100.00% 100 kb 777 777 461,239,143 461,239,143 100.00% 250 kb 503 503 416,969,146 416,969,146 100.00% 500 kb 264 264 333,195,289 333,195,289 100.00% 1 mb 94 94 216,428,804 216,428,804 100.00% 2.5 mb 16 16 110,129,116 110,129,116 100.00% 5 mb 7 7 77,143,401 77,143,401 100.00%

>less D01.a_ctg.gfa.fa.stat: Main genome scaffold total: 908 Main genome contig total: 908 Main genome scaffold sequence total: 26.9 MB Main genome contig sequence total: 26.9 MB (-> 0.0% gap) Main genome scaffold N/L50: 245/30.5 KB Main genome contig N/L50: 245/30.5 KB Number of scaffolds > 50 KB: 42 % main genome in scaffolds > 50 KB: 22.6% Minimum Number Number Total Total Scaffold Scaffold of of Scaffold Contig Contig Length Scaffolds Contigs Length Length Coverage


All       908        908   26,919,675   26,919,675   100.00%

1 kb 908 908 26,919,675 26,919,675 100.00% 2.5 kb 908 908 26,919,675 26,919,675 100.00% 5 kb 908 908 26,919,675 26,919,675 100.00% 10 kb 908 908 26,919,675 26,919,675 100.00% 25 kb 381 381 17,255,254 17,255,254 100.00% 50 kb 42 42 6,071,312 6,071,312 100.00% 100 kb 23 23 4,805,273 4,805,273 100.00% 250 kb 7 7 2,217,969 2,217,969 100.00% 500 kb 0 0 0 0 0.00%

I am confused that my final primary assembly.size (506.7MB) is about 2x larger than my genomscope estimation(246MB),; and the size of contig N50 is just 786.8 KB;

  1. I have try to run Purge-dups option and set to -l2

hifiasm -o D01 -t40 -z20 -l2 D01.ccs.fasta.gz

Reads has been loaded. Loading ma_hit_ts from disk... ma_hit_ts has been read. Loading ma_hit_ts from disk... ma_hit_ts has been read. [M::ha_assemble::58.059*1.00] ==> loaded corrected reads and overlaps from disk [M::purge_dups] purge duplication coverage threshold: 103 [M::purge_dups] purge duplication coverage threshold: 103 [M::purge_dups] purge duplication coverage threshold: 103 Writing raw unitig GFA to disk... Writing processed unitig GFA to disk... [M::purge_dups] purge duplication coverage threshold: 103 [M::purge_dups] purge duplication coverage threshold: 103 [M::adjust_utg_by_primary] primary contig coverage range: [69, infinity] Writing primary contig GFA to disk... Writing alternate contig GFA to disk... Inconsistency threshold for low-quality regions in BED files: 70% [M::main] Version: 0.14.2-r315 [M::main] CMD: hifiasm -o D01 -t40 -z20 D01.ccs.fasta.gz [M::main] Real time: 472.749 sec; CPU: 511.921 sec; Peak RSS: 17.147 GB

less D01.p_ctg.purge_l2.fa.stat Main genome scaffold total: 1400 Main genome contig total: 1400 Main genome scaffold sequence total: 419.3 MB Main genome contig sequence total: 419.3 MB (-> 0.0% gap) Main genome scaffold N/L50: 81/1.0 MB Main genome contig N/L50: 81/1.0 MB Number of scaffolds > 50 KB: 662 % main genome in scaffolds > 50 KB: 93.9% Minimum Number Number Total Total Scaffold Scaffold of of Scaffold Contig Contig Length Scaffolds Contigs Length Length Coverage All 1,400 1,400 419,294,817 419,294,817 100.00% 1 kb 1,400 1,400 419,294,817 419,294,817 100.00% 2.5 kb 1,400 1,400 419,294,817 419,294,817 100.00% 5 kb 1,400 1,400 419,294,817 419,294,817 100.00% 10 kb 1,400 1,400 419,294,817 419,294,817 100.00% 25 kb 1,346 1,346 418,127,419 418,127,419 100.00% 50 kb 662 662 393,797,369 393,797,369 100.00% 100 kb 513 513 383,968,649 383,968,649 100.00% 250 kb 367 367 361,046,294 361,046,294 100.00% 500 kb 219 219 307,877,656 307,877,656 100.00% 1 mb 86 86 215,047,815 215,047,815 100.00% 2.5 mb 18 18 119,642,932 119,642,932 100.00% 5 mb 7 7 79,982,021 79,982,021 100.00%

less D01.a_ctg.purge_l2.fa.stat: Main genome scaffold total: 1717 Main genome contig total: 1717 Main genome scaffold sequence total: 121.5 MB Main genome contig sequence total: 121.5 MB (-> 0.0% gap) Main genome scaffold N/L50: 158/202.8 KB Main genome contig N/L50: 158/202.8 KB Number of scaffolds > 50 KB: 419 % main genome in scaffolds > 50 KB: 73.0% Minimum Number Number Total Total Scaffold Scaffold of of Scaffold Contig Contig Length Scaffolds Contigs Length Length Coverage


All     1,717      1,717  121,455,511  121,455,511   100.00%

1 kb 1,717 1,717 121,455,511 121,455,511 100.00% 2.5 kb 1,717 1,717 121,455,511 121,455,511 100.00% 5 kb 1,717 1,717 121,455,511 121,455,511 100.00% 10 kb 1,717 1,717 121,455,511 121,455,511 100.00% 25 kb 987 987 107,920,802 107,920,802 100.00% 50 kb 419 419 88,720,682 88,720,682 100.00% 100 kb 279 279 78,658,916 78,658,916 100.00% 250 kb 122 122 52,782,938 52,782,938 100.00% 500 kb 26 26 19,819,999 19,819,999 100.00% 1 mb 6 6 7,011,209 7,011,209 100.00% 2.5 mb 0 0 0 0 0.00%

The final size (419MB) is still seriously larger than the genomscope estimation(246MB); And the N50 did not look better; By the way, I have run the wtdbg2. (total size: 296156554, N50 437236, ContigNum: 1631). Could please give me some advice?

chhylp123 commented 3 years ago

For now you can have a try with '-l3'. We will also make a new release soon, which probably can lead to better N50. However, the low N50 of your assembly probably caused more by data itself. Could you please check why contigs are broken? You can map reads back to assembly and check the alignment around contig ends.

StevenBai97 commented 3 years ago

For now you can have a try with '-l3'. We will also make a new release soon, which probably can lead to better N50. However, the low N50 of your assembly probably caused more by data itself. Could you please check why contigs are broken? You can map reads back to assembly and check the alignment around contig ends.

Hi. When will the new version be released? I used Hi-C mode but it was unstable. Will the new version with stable Hi-C mode lead to less misassembly and more accuracy? I'm looking forward to it.

chhylp123 commented 3 years ago

It's here: https://github.com/chhylp123/hifiasm/releases/tag/0.15. Hope it works.