chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
526 stars 86 forks source link

Optimised parameters for highly repetitive genomes #385

Open deoliveira86 opened 1 year ago

deoliveira86 commented 1 year ago

Dear all,

I have been running hifiasm on two very repetitive genomes (>70% repeat content) using relatively deep HiFi libraries (60x-80x). The results are not that great and the genome assemblies are too fragmented (even considering the non-optimal insert size of the PacBio libraries generated - N50 ~ 10Kb). I check the FAQ and it looks like I can play around with some parameters, mainly -D or -N, to improve my assemblies. In the FAQ it is stated:

Raising -D or -N may improve the resolution of repetitive regions but takes longer time. These two options affect all types of assemblies and usually do not have a negative impact on the assembly quality.

However, it is not specified what how high should I raise the default values. Considering that I am losing many true high-frequency k-mers due the high repeat content of my genome, and that might be one of the causes of my genome fragmentation, I am thinking to set unrealistic high numbers for -D and -N (e.g., 1000000) to overcome this issue. Is this advisable? Would another set of parameters be also useful to assemble highly repetitive genomes?

Best and I am looking forward for some guidance, André

chhylp123 commented 1 year ago

How fragmented it is? If it is too fragmented, there might be some other issues.

deoliveira86 commented 1 year ago

Hello @chhylp123, thanks for answering so quickly.

The primary hifiasm assemblies I got contain roughly ~13k contigs and a N50 ~500Kb. When I said the assemblies are fragmented I based on the CLR Canu assemblies I have for the same species (~7k contigs and N50 of ~300kb). I was investigating the causes of the fragmentation and I did some sanity checks, first:

  1. Checking if there were any contaminants in the HiFi reads: for that I mapped the HiFi libraries against the CLR assemblies I have and the mapping rates were above 94%, indicating that there is no contamination issue.
  2. Estimating the genome sizes using the HiFi data: I used the complete and mapped HiFi reads to estimate the genome sizes of my organisms, as well as the repeat content, and heterozogosity and the results are pretty much overlapping with my old estimations. I don't see any problems in the quality of the HiFi reads. As a matter of fact, I generated ultra-HiFi reads filtering the HiFi libraries and retrieving only reads with QV>30. Even with this aggressive filtering I still have the minimum required coverage necessary for hifiasm to work (13x per haplotype).

So my questions, now based on my explanations (sorry for the really not-so-scientific initial post), are:

  1. Do you think adjusting the -D and -N parameters would improve the contiguity of my assembly?
  2. Do you recommend running independent runs with HiFi (QV20) and UltraHifi reads (QV30) to check if there are differences in the final assemblies? There are three rounds of error-correction steps before assembling, so I am wondering if it is actually worth trying assembling the ultraHiFi reads.

bs: I realised as well using the kmer spectra and smudgeplot that one of my organisms are tetraploid. So, I have no high hopes of getting a really contiguous assembly of this one.

Thanks for your help, André

deoliveira86 commented 1 year ago

Hello @chhylp123 ,

I have been looking more in-depth into my logs and it seems that many more problems are present with my run. From the first histogram I can see that hifiasm guessed incorrectly the homozygous and heterozygous peaks:

[M::ha_analyze_count] lowest: count[8] = 11679204 [M::ha_analyze_count] highest: count[17] = 19956681 [M::ha_hist_line] 2: ****> 145302474 [M::ha_hist_line] 3: ****> 60436822 [M::ha_hist_line] 4: ****> 34614646 [M::ha_hist_line] 5: ****> 21814476 [M::ha_hist_line] 6: ** 15486457 [M::ha_hist_line] 7: **** 12634884 [M::ha_hist_line] 8: 11679204 [M::ha_hist_line] 9: * 11818489 [M::ha_hist_line] 10: **** 12795496 [M::ha_hist_line] 11: ** 14046199 [M::ha_hist_line] 12: ** 15561287 [M::ha_hist_line] 13: *** 16984008 [M::ha_hist_line] 14: * 18191044 [M::ha_hist_line] 15: **** 19171775 [M::ha_hist_line] 16: ***** 19810701 [M::ha_hist_line] 17: **** 19956681 [M::ha_hist_line] 18: ** 19563846 [M::ha_hist_line] 19: ** 18759511 [M::ha_hist_line] 20: *** 17679955 [M::ha_hist_line] 21: ** 16426665 [M::ha_hist_line] 22: * 15017106 [M::ha_hist_line] 23: ***** 13682918 [M::ha_hist_line] 24: ** 12462893 [M::ha_hist_line] 25: * 11410230 [M::ha_hist_line] 26: * 10623346 [M::ha_hist_line] 27: ** 10061404 [M::ha_hist_line] 28: * 9698887 [M::ha_hist_line] 29: 9457651 [M::ha_hist_line] 30: 9404224 [M::ha_hist_line] 31: * 9403254 [M::ha_hist_line] 32: **** 9509125 [M::ha_hist_line] 33: **** 9544156 [M::ha_hist_line] 34: **** 9585336 [M::ha_hist_line] 35: **** 9606608 [M::ha_hist_line] 36: *** 9710124 [M::ha_hist_line] 37: 9735350 [M::ha_hist_line] 38: 9683761 [M::ha_hist_line] 39: **** 9638125 [M::ha_hist_line] 40: **** 9510340 [M::ha_hist_line] 41: 9421320 [M::ha_hist_line] 42: 9327132 [M::ha_hist_line] 43: ** 9227396 [M::ha_hist_line] 44: ** 9197687 [M::ha_hist_line] 45: ** 9118461 [M::ha_hist_line] 46: 9032115 [M::ha_hist_line] 47: 8962862 [M::ha_hist_line] 48: * 9044835 [M::ha_hist_line] 49: ** 9089925 [M::ha_hist_line] 50: ** 9147819 [M::ha_hist_line] 51: ** 9197964 [M::ha_hist_line] 52: *** 9283452 [M::ha_hist_line] 53: 9393173 [M::ha_hist_line] 54: 9467882 [M::ha_hist_line] 55: **** 9593509 [M::ha_hist_line] 56: **** 9607272 [M::ha_hist_line] 57: 9729602 [M::ha_hist_line] 58: 9815127 [M::ha_hist_line] 59: * 9815411 [M::ha_hist_line] 60: ** 9891586 [M::ha_hist_line] 61: ** 9961505 [M::ha_hist_line] 62: ** 10012431 [M::ha_hist_line] 63: ** 10041774 [M::ha_hist_line] 64: ** 9987964 [M::ha_hist_line] 65: ** 9995814 [M::ha_hist_line] 66: ** 9915976 [M::ha_hist_line] 67: * 9798204 [M::ha_hist_line] 68: * 9704423 [M::ha_hist_line] 69: ** 9557415 [M::ha_hist_line] 70: * 9457241 [M::ha_hist_line] 71: * 9304365 [M::ha_hist_line] 72: ** 9127224 [M::ha_hist_line] 73: **** 8875247 [M::ha_hist_line] 74: * 8559737 [M::ha_hist_line] 75: * 8264445 [M::ha_hist_line] 76: **** 7977267 [M::ha_hist_line] 77: ** 7678267 [M::ha_hist_line] 78: * 7342879 [M::ha_hist_line] 79: ** 7006039 [M::ha_hist_line] 80: 6635521 [M::ha_hist_line] 81: **** 6286913 [M::ha_hist_line] 82: ** 5937857 [M::ha_hist_line] 83: **** 5546121 [M::ha_hist_line] 84: ** 5198401 [M::ha_hist_line] 85: **** 4872826 [M::ha_hist_line] 86: ** 4534228 [M::ha_hist_line] 87: 4194903 [M::ha_hist_line] 88: **** 3905407 [M::ha_hist_line] 89: ** 3601831 [M::ha_hist_line] 90: ** 3330844 [M::ha_hist_line] 91: 3049526 [M::ha_hist_line] 92: ** 2759729 [M::ha_hist_line] 93: * 2553021 [M::ha_hist_line] 94: ** 2340541 [M::ha_hist_line] 95: * 2145361 [M::ha_hist_line] 96: ** 1958096 [M::ha_hist_line] 97: * 1781076 [M::ha_hist_line] 98: ** 1615652 [M::ha_hist_line] 99: * 1486306 [M::ha_hist_line] 100: * 1349445 [M::ha_hist_line] 101: **** 1233669 [M::ha_hist_line] 102: ** 1126847 [M::ha_hist_line] 103: * 1032527 [M::ha_hist_line] 104: * 952525 [M::ha_hist_line] 105: ** 887802 [M::ha_hist_line] 106: 829874 [M::ha_hist_line] 107: 790556 [M::ha_hist_line] 108: *** 739644 [M::ha_hist_line] 109: 692297 [M::ha_hist_line] 110: 644661 [M::ha_hist_line] 111: 611743 [M::ha_hist_line] 112: 583629 [M::ha_hist_line] 113: 559103 [M::ha_hist_line] 114: 527284 [M::ha_hist_line] 115: 504927 [M::ha_hist_line] 116: 488361 [M::ha_hist_line] 117: 467597 [M::ha_hist_line] 118: 450704 [M::ha_hist_line] 119: 436627 [M::ha_hist_line] 120: 425041 [M::ha_hist_line] 121: 413458 [M::ha_hist_line] 122: 404118 [M::ha_hist_line] 123: 389630 [M::ha_hist_line] 124: 383743 [M::ha_hist_line] 125: 376041 [M::ha_hist_line] 126: 365101 [M::ha_hist_line] 127: 358619 [M::ha_hist_line] 128: 347929 [M::ha_hist_line] 129: 336819 [M::ha_hist_line] 130: 333450 [M::ha_hist_line] 131: 325973 [M::ha_hist_line] 132: 319702 [M::ha_hist_line] 133: 313799 [M::ha_hist_line] 134: 309566 [M::ha_hist_line] 135: 304119 [M::ha_hist_line] 136: 296935 [M::ha_hist_line] 137: 288307 [M::ha_hist_line] 138: 284828 [M::ha_hist_line] 139: 280656 [M::ha_hist_line] 140: 277848 [M::ha_hist_line] 141: 269233 [M::ha_hist_line] 142: 263987 [M::ha_hist_line] 143: 260061 [M::ha_hist_line] 144: 251999 [M::ha_hist_line] 145: 250445 [M::ha_hist_line] 146: 246546 [M::ha_hist_line] 147: 241293 [M::ha_hist_line] 148: 240424 [M::ha_hist_line] 149: 233721 [M::ha_hist_line] 150: 229937 [M::ha_hist_line] 151: 225499 [M::ha_hist_line] 152: 223910 [M::ha_hist_line] 153: 215218 [M::ha_hist_line] 154: 214814 [M::ha_hist_line] 155: 212414 [M::ha_hist_line] 156: 207643 [M::ha_hist_line] 157: 206194 [M::ha_hist_line] 158: 201213 [M::ha_hist_line] 159: 198221 [M::ha_hist_line] 160: 191964 [M::ha_hist_line] 161: 190480 [M::ha_hist_line] 162: 187920 [M::ha_hist_line] 163: 183118 [M::ha_hist_line] 164: 181969 [M::ha_hist_line] 165: 179575 [M::ha_hist_line] 166: 176828 [M::ha_hist_line] 167: 176792 [M::ha_hist_line] 168: 172308 [M::ha_hist_line] 169: 167354 [M::ha_hist_line] 170: 164825 [M::ha_hist_line] 171: 161483 [M::ha_hist_line] 172: 162846 [M::ha_hist_line] 173: 159886 [M::ha_hist_line] 174: 157258 [M::ha_hist_line] 175: 153493 [M::ha_hist_line] 176: 148997 [M::ha_hist_line] 177: 145954 [M::ha_hist_line] 178: 145244 [M::ha_hist_line] 179: 144501 [M::ha_hist_line] 180: 140359 [M::ha_hist_line] 181: 138636 [M::ha_hist_line] 182: 140018 [M::ha_hist_line] 183: 135632 [M::ha_hist_line] 184: 136020 [M::ha_hist_line] 185: 132477 [M::ha_hist_line] 186: 135719 [M::ha_hist_line] 187: 133683 [M::ha_hist_line] 188: 130723 [M::ha_hist_line] 189: 128740 [M::ha_hist_line] 190: 126420 [M::ha_hist_line] 191: 121487 [M::ha_hist_line] 192: 122507 [M::ha_hist_line] 193: 119770 [M::ha_hist_line] 194: 115733 [M::ha_hist_line] 195: 115267 [M::ha_hist_line] 196: 115383 [M::ha_hist_line] 197: 115888 [M::ha_hist_line] 198: 114891 [M::ha_hist_line] 199: 114816 [M::ha_hist_line] 200: 113248 [M::ha_hist_line] 201: 108648 [M::ha_hist_line] 202: 106207 [M::ha_hist_line] 203: 107001 [M::ha_hist_line] 204: 104715 [M::ha_hist_line] 205: 104146 [M::ha_hist_line] 206: * 102514 [M::ha_hist_line] rest: ****> 20508566 [M::ha_analyze_count] left: none [M::ha_analyze_count] right: none [M::ha_ft_gen] peak_hom: 17; peak_het: -1

Which is odd, because my GenomeScope2 plots are pretty clean cut (at least to me) (see below):

linear_plot

From the plot I can clearly see that the homozygous and heterozygous peaks have ~74x and ~30x coverage, respectively. Is there a reason hifiasm is guessing wrongly these values? I can see that this was extensively discussing in the issues #55, #78, #156, however, I still did not find a solution for hifiasm to identify correctly the two peaks, either by forcing it (with --hom-cov) or leaving by the defaults.

Best, André

chhylp123 commented 1 year ago

If it is tetraploid, then peak_hom = 17 might be right since it is roughly equal to the coverage per haplotype. For your sample, increasing -D to 20 or -N to 400 might be helpful. But in any way, the assembly should not be such fragmented unless it is a huge genome. Could you please also have a try with ultra-HiFi reads if the coverage is enough? To make sure it is the issue of hifiasm or data, another solution is to run HiCanu (instead of Canu). If HiCanu also does not work well, I guess it might be caused by the data quality.

deoliveira86 commented 1 year ago

Hello @chhylp123,

Thanks for the input. I will run HiCanu with the ultra-hifi and play a little bit with hifiasm following your parameter suggestions. I will close this thread and re-opened when I have more idea about what is happening. Just for clarification, this Genomescope plot correspond to the diploid organism (which I reconfirmed with smudgeplot as well).

Thanks again and let's see how the HiFi data will lead me.

Best, André

deoliveira86 commented 12 months ago

Hello again,

After months of troubleshooting and tears, I identified with the help of Pb bioinformaticians what was causing the issues with the assemblies and it was the employment of the ultra-low input HiFi protocol for sequencing my target species. The genome is relatively large and full of repeats which due the PCR-based nature of the protocol did not cover the whole breadth of the genome, causing the massive fragmentation and suboptimal results.

We have sequenced now using the low-input protocol five different individuals and the results are much better in terms of completeness and contiguity. However, since each low-input SMRT cell yields around 10-13x coverage, the draft assemblies are a bit fragmented and I would like to pool all the libraries together and obtain a more contiguous genomic reference. My attempts so far with the 2, 3, 4 and the 5 pooled libraries have been a failure. The pooled genome is highly fragmented and the reconstructed genome size is way bigger than expected. I am assuming that this might be the heterozigosity among the individuals, however, genomeScope2 plots show me that the heterozigosity did not change much from a single to pooled invididuals. Please see below:

2 libraries pooled (~23x coverage):

image

3 libraries pooled (~36x coverage):

image

4 libraries pooled (~51x coverage):

image

5 libraries pooled (~65x coverage):

image

If the genomeScope plots one can see clearly how the increase in coverage improves the detection of the heterozygous peak, and that the heterozygosity levels remain fairly the same. The genome size estimations and repeat content are quite stable too, which suggest me that the plots are trustworthy. I am puzzled why I cannot get better assemblies using the pooled libraries. Things that I noticed running hifiasm is:

  1. To obtain kmer plots in hifiasm that are akin to the genomescope plots I needed to adjust the kmer size to 19. Using hifiasm with the default kmer size results in massive miscalls of the heterozygous and homozygous peaks.
  2. With the kmer=19 adjusted the first kmer plot looks fairly identical to the genomeScope, however, after the error correction steps the homozygous and heterozygous peaks are wrongly assumed. Is this a normal behaviour?
  3. I am running hifiasm adjusting the genome size, purge-max, and hom-cov based on the genomescope plots and additionally I am adjusting the runs to account for the massive repeat content of my samples with the commands -D 20 -N 200 as you previously suggested.

Do you have any suggestions what can be done further to improve assembly contiguity using the pooled samples? I am running hi-flye at the moment to see if I can get better results, since flye tends to collapse different haplotigs. I am assuming that could improve the assembly if the issue is heterozygosity.

Best and thanks again for the help, André

ishengtsai commented 4 months ago

Hi André @deoliveira86 ,

I have an issue which is very similar to yours. I was wondering if you have a solution?

Best, Jason

deoliveira86 commented 3 months ago

What worked the best was assembling the datasets individually then scaffolding the results later (e.g. ragtag). Good luck.Em 20.05.2024, à(s) 11:01, ishengtsai @.***> escreveu: Hi André @deoliveira86 , I have an issue which is very similar to yours. I was wondering if you have a solution? Best, Jason

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>