Optimised parameters for highly repetitive genomes

Dear all,

I have been running hifiasm on two very repetitive genomes (>70% repeat content) using relatively deep HiFi libraries (60x-80x). The results are not that great and the genome assemblies are too fragmented (even considering the non-optimal insert size of the PacBio libraries generated - N50 ~ 10Kb). I check the FAQ and it looks like I can play around with some parameters, mainly -D or -N, to improve my assemblies. In the FAQ it is stated:

Raising -D or -N may improve the resolution of repetitive regions but takes longer time. These two options affect all types of assemblies and usually do not have a negative impact on the assembly quality.

However, it is not specified what how high should I raise the default values. Considering that I am losing many true high-frequency k-mers due the high repeat content of my genome, and that might be one of the causes of my genome fragmentation, I am thinking to set unrealistic high numbers for -D and -N (e.g., 1000000) to overcome this issue. Is this advisable? Would another set of parameters be also useful to assemble highly repetitive genomes?

Best and I am looking forward for some guidance, André

How fragmented it is? If it is too fragmented, there might be some other issues.

Hello @chhylp123, thanks for answering so quickly.

The primary hifiasm assemblies I got contain roughly ~13k contigs and a N50 ~500Kb. When I said the assemblies are fragmented I based on the CLR Canu assemblies I have for the same species (~7k contigs and N50 of ~300kb). I was investigating the causes of the fragmentation and I did some sanity checks, first:

Checking if there were any contaminants in the HiFi reads: for that I mapped the HiFi libraries against the CLR assemblies I have and the mapping rates were above 94%, indicating that there is no contamination issue.
Estimating the genome sizes using the HiFi data: I used the complete and mapped HiFi reads to estimate the genome sizes of my organisms, as well as the repeat content, and heterozogosity and the results are pretty much overlapping with my old estimations. I don't see any problems in the quality of the HiFi reads. As a matter of fact, I generated ultra-HiFi reads filtering the HiFi libraries and retrieving only reads with QV>30. Even with this aggressive filtering I still have the minimum required coverage necessary for hifiasm to work (13x per haplotype).

So my questions, now based on my explanations (sorry for the really not-so-scientific initial post), are:

Do you think adjusting the -D and -N parameters would improve the contiguity of my assembly?
Do you recommend running independent runs with HiFi (QV20) and UltraHifi reads (QV30) to check if there are differences in the final assemblies? There are three rounds of error-correction steps before assembling, so I am wondering if it is actually worth trying assembling the ultraHiFi reads.

bs: I realised as well using the kmer spectra and smudgeplot that one of my organisms are tetraploid. So, I have no high hopes of getting a really contiguous assembly of this one.

Thanks for your help, André

Hello @chhylp123 ,

I have been looking more in-depth into my logs and it seems that many more problems are present with my run. From the first histogram I can see that hifiasm guessed incorrectly the homozygous and heterozygous peaks:

[M::ha_analyze_count] [M::ha_analyze_count] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] 108: *** 739644 [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] [M::ha_hist_line] 136: 296935 [M::ha_hist_line] 137: 288307 [M::ha_hist_line] 138: 284828 [M::ha_hist_line] 139: 280656 [M::ha_hist_line] 140: 277848 [M::ha_hist_line] 141: 269233 [M::ha_hist_line] 142: 263987 [M::ha_hist_line] 143: 260061 [M::ha_hist_line] 144: 251999 [M::ha_hist_line] 145: 250445 [M::ha_hist_line] 146: 246546 [M::ha_hist_line] 147: 241293 [M::ha_hist_line] 148: 240424 [M::ha_hist_line] 149: 233721 [M::ha_hist_line] 150: 229937 [M::ha_hist_line] 151: 225499 [M::ha_hist_line] 152: 223910 [M::ha_hist_line] 153: 215218 [M::ha_hist_line] 154: 214814 [M::ha_hist_line] 155: 212414 [M::ha_hist_line] 156: 207643 [M::ha_hist_line] 157: 206194 [M::ha_hist_line] 158: 201213 [M::ha_hist_line] 159: 198221 [M::ha_hist_line] 160: 191964 [M::ha_hist_line] 161: 190480 [M::ha_hist_line] 162: 187920 [M::ha_hist_line] 163: 183118 [M::ha_hist_line] 164: 181969 [M::ha_hist_line] 165: 179575 [M::ha_hist_line] 166: 176828 [M::ha_hist_line] 167: 176792 [M::ha_hist_line] 168: 172308 [M::ha_hist_line] 169: 167354 [M::ha_hist_line] 170: 164825 [M::ha_hist_line] 171: 161483 [M::ha_hist_line] 172: 162846 [M::ha_hist_line] 173: 159886 [M::ha_hist_line] 174: 157258 [M::ha_hist_line] 175: 153493 [M::ha_hist_line] 176: 148997 [M::ha_hist_line] 177: 145954 [M::ha_hist_line] 178: 145244 [M::ha_hist_line] 179: 144501 [M::ha_hist_line] 180: 140359 [M::ha_hist_line] 181: 138636 [M::ha_hist_line] 182: 140018 [M::ha_hist_line] 183: 135632 [M::ha_hist_line] 184: 136020 [M::ha_hist_line] 185: 132477 [M::ha_hist_line] 186: 135719 [M::ha_hist_line] 187: 133683 [M::ha_hist_line] 188: 130723 [M::ha_hist_line] 189: 128740 [M::ha_hist_line] 190: 126420 [M::ha_hist_line] 191: 121487 [M::ha_hist_line] 192: 122507 [M::ha_hist_line] 193: 119770 [M::ha_hist_line] 194: 115733 [M::ha_hist_line] 195: 115267 [M::ha_hist_line] 196: 115383 [M::ha_hist_line] 197: 115888 [M::ha_hist_line] 198: 114891 [M::ha_hist_line] 199: 114816 [M::ha_hist_line] 200: 113248 [M::ha_hist_line] 201: 108648 [M::ha_hist_line] 202: 106207 [M::ha_hist_line] 203: 107001 [M::ha_hist_line] 204: 104715 [M::ha_hist_line] 205: 104146 [M::ha_hist_line] 206: * 102514 [M::ha_hist_line] [M::ha_analyze_count] left: none [M::ha_analyze_count] right: none [M::ha_ft_gen] lowest: count[8] = 11679204 highest: count[17] = 19956681 2: ****> 145302474 3: ****> 60436822 4: ****> 34614646 5: ****> 21814476 6: ** 15486457 7: **** 12634884 8: 11679204 9: * 11818489 10: **** 12795496 11: ** 14046199 12: ** 15561287 13: *** 16984008 14: * 18191044 15: **** 19171775 16: ***** 19810701 17: **** 19956681 18: ** 19563846 19: ** 18759511 20: *** 17679955 21: ** 16426665 22: * 15017106 23: ***** 13682918 24: ** 12462893 25: * 11410230 26: * 10623346 27: ** 10061404 28: * 9698887 29: 9457651 30: 9404224 31: * 9403254 32: **** 9509125 33: **** 9544156 34: **** 9585336 35: **** 9606608 36: *** 9710124 37: 9735350 38: 9683761 39: **** 9638125 40: **** 9510340 41: 9421320 42: 9327132 43: ** 9227396 44: ** 9197687 45: ** 9118461 46: 9032115 47: 8962862 48: * 9044835 49: ** 9089925 50: ** 9147819 51: ** 9197964 52: *** 9283452 53: 9393173 54: 9467882 55: **** 9593509 56: **** 9607272 57: 9729602 58: 9815127 59: * 9815411 60: ** 9891586 61: ** 9961505 62: ** 10012431 63: ** 10041774 64: ** 9987964 65: ** 9995814 66: ** 9915976 67: * 9798204 68: * 9704423 69: ** 9557415 70: * 9457241 71: * 9304365 72: ** 9127224 73: **** 8875247 74: * 8559737 75: * 8264445 76: **** 7977267 77: ** 7678267 78: * 7342879 79: ** 7006039 80: 6635521 81: **** 6286913 82: ** 5937857 83: **** 5546121 84: ** 5198401 85: **** 4872826 86: ** 4534228 87: 4194903 88: **** 3905407 89: ** 3601831 90: ** 3330844 91: 3049526 92: ** 2759729 93: * 2553021 94: ** 2340541 95: * 2145361 96: ** 1958096 97: * 1781076 98: ** 1615652 99: * 1486306 100: * 1349445 101: **** 1233669 102: ** 1126847 103: * 1032527 104: * 952525 105: ** 887802 106: 829874 107: 790556 109: 692297 110: 644661 111: 611743 112: 583629 113: 559103 114: 527284 115: 504927 116: 488361 117: 467597 118: 450704 119: 436627 120: 425041 121: 413458 122: 404118 123: 389630 124: 383743 125: 376041 126: 365101 127: 358619 128: 347929 129: 336819 130: 333450 131: 325973 132: 319702 133: 313799 134: 309566 135: 304119 rest: ****> 20508566 peak_hom: 17; peak_het: -1

Which is odd, because my GenomeScope2 plots are pretty clean cut (at least to me) (see below):

linear_plot

From the plot I can clearly see that the homozygous and heterozygous peaks have ~74x and ~30x coverage, respectively. Is there a reason hifiasm is guessing wrongly these values? I can see that this was extensively discussing in the issues #55, #78, #156, however, I still did not find a solution for hifiasm to identify correctly the two peaks, either by forcing it (with --hom-cov) or leaving by the defaults.

Best, André

If it is tetraploid, then peak_hom = 17 might be right since it is roughly equal to the coverage per haplotype. For your sample, increasing -D to 20 or -N to 400 might be helpful. But in any way, the assembly should not be such fragmented unless it is a huge genome. Could you please also have a try with ultra-HiFi reads if the coverage is enough? To make sure it is the issue of hifiasm or data, another solution is to run HiCanu (instead of Canu). If HiCanu also does not work well, I guess it might be caused by the data quality.

Hello @chhylp123,

Thanks for the input. I will run HiCanu with the ultra-hifi and play a little bit with hifiasm following your parameter suggestions. I will close this thread and re-opened when I have more idea about what is happening. Just for clarification, this Genomescope plot correspond to the diploid organism (which I reconfirmed with smudgeplot as well).

Thanks again and let's see how the HiFi data will lead me.

Best, André

Hello again,

After months of troubleshooting and tears, I identified with the help of Pb bioinformaticians what was causing the issues with the assemblies and it was the employment of the ultra-low input HiFi protocol for sequencing my target species. The genome is relatively large and full of repeats which due the PCR-based nature of the protocol did not cover the whole breadth of the genome, causing the massive fragmentation and suboptimal results.

We have sequenced now using the low-input protocol five different individuals and the results are much better in terms of completeness and contiguity. However, since each low-input SMRT cell yields around 10-13x coverage, the draft assemblies are a bit fragmented and I would like to pool all the libraries together and obtain a more contiguous genomic reference. My attempts so far with the 2, 3, 4 and the 5 pooled libraries have been a failure. The pooled genome is highly fragmented and the reconstructed genome size is way bigger than expected. I am assuming that this might be the heterozigosity among the individuals, however, genomeScope2 plots show me that the heterozigosity did not change much from a single to pooled invididuals. Please see below:

2 libraries pooled (~23x coverage):

3 libraries pooled (~36x coverage):

4 libraries pooled (~51x coverage):

5 libraries pooled (~65x coverage):

If the genomeScope plots one can see clearly how the increase in coverage improves the detection of the heterozygous peak, and that the heterozygosity levels remain fairly the same. The genome size estimations and repeat content are quite stable too, which suggest me that the plots are trustworthy. I am puzzled why I cannot get better assemblies using the pooled libraries. Things that I noticed running hifiasm is:

To obtain kmer plots in hifiasm that are akin to the genomescope plots I needed to adjust the kmer size to 19. Using hifiasm with the default kmer size results in massive miscalls of the heterozygous and homozygous peaks.
With the kmer=19 adjusted the first kmer plot looks fairly identical to the genomeScope, however, after the error correction steps the homozygous and heterozygous peaks are wrongly assumed. Is this a normal behaviour?
I am running hifiasm adjusting the genome size, purge-max, and hom-cov based on the genomescope plots and additionally I am adjusting the runs to account for the massive repeat content of my samples with the commands -D 20 -N 200 as you previously suggested.

Do you have any suggestions what can be done further to improve assembly contiguity using the pooled samples? I am running hi-flye at the moment to see if I can get better results, since flye tends to collapse different haplotigs. I am assuming that could improve the assembly if the issue is heterozygosity.

Best and thanks again for the help, André

Hi André @deoliveira86 ,

I have an issue which is very similar to yours. I was wondering if you have a solution?

Best, Jason

What worked the best was assembling the datasets individually then scaffolding the results later (e.g. ragtag). Good luck.Em 20.05.2024, à(s) 11:01, ishengtsai @.***> escreveu: Hi André @deoliveira86 , I have an issue which is very similar to yours. I was wondering if you have a solution? Best, Jason

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

chhylp123 / hifiasm

Optimised parameters for highly repetitive genomes #385