Open deoliveira86 opened 1 year ago
How fragmented it is? If it is too fragmented, there might be some other issues.
Hello @chhylp123, thanks for answering so quickly.
The primary hifiasm assemblies I got contain roughly ~13k contigs and a N50 ~500Kb. When I said the assemblies are fragmented I based on the CLR Canu assemblies I have for the same species (~7k contigs and N50 of ~300kb). I was investigating the causes of the fragmentation and I did some sanity checks, first:
So my questions, now based on my explanations (sorry for the really not-so-scientific initial post), are:
bs: I realised as well using the kmer spectra and smudgeplot that one of my organisms are tetraploid. So, I have no high hopes of getting a really contiguous assembly of this one.
Thanks for your help, André
Hello @chhylp123 ,
I have been looking more in-depth into my logs and it seems that many more problems are present with my run. From the first histogram I can see that hifiasm guessed incorrectly the homozygous and heterozygous peaks:
[M::ha_analyze_count] lowest: count[8] = 11679204 [M::ha_analyze_count] highest: count[17] = 19956681 [M::ha_hist_line] 2: ****> 145302474 [M::ha_hist_line] 3: ****> 60436822 [M::ha_hist_line] 4: ****> 34614646 [M::ha_hist_line] 5: ****> 21814476 [M::ha_hist_line] 6: ** 15486457 [M::ha_hist_line] 7: **** 12634884 [M::ha_hist_line] 8: 11679204 [M::ha_hist_line] 9: * 11818489 [M::ha_hist_line] 10: **** 12795496 [M::ha_hist_line] 11: ** 14046199 [M::ha_hist_line] 12: ** 15561287 [M::ha_hist_line] 13: *** 16984008 [M::ha_hist_line] 14: * 18191044 [M::ha_hist_line] 15: **** 19171775 [M::ha_hist_line] 16: ***** 19810701 [M::ha_hist_line] 17: **** 19956681 [M::ha_hist_line] 18: ** 19563846 [M::ha_hist_line] 19: ** 18759511 [M::ha_hist_line] 20: *** 17679955 [M::ha_hist_line] 21: ** 16426665 [M::ha_hist_line] 22: * 15017106 [M::ha_hist_line] 23: ***** 13682918 [M::ha_hist_line] 24: ** 12462893 [M::ha_hist_line] 25: * 11410230 [M::ha_hist_line] 26: * 10623346 [M::ha_hist_line] 27: ** 10061404 [M::ha_hist_line] 28: * 9698887 [M::ha_hist_line] 29: 9457651 [M::ha_hist_line] 30: 9404224 [M::ha_hist_line] 31: * 9403254 [M::ha_hist_line] 32: **** 9509125 [M::ha_hist_line] 33: **** 9544156 [M::ha_hist_line] 34: **** 9585336 [M::ha_hist_line] 35: **** 9606608 [M::ha_hist_line] 36: *** 9710124 [M::ha_hist_line] 37: 9735350 [M::ha_hist_line] 38: 9683761 [M::ha_hist_line] 39: **** 9638125 [M::ha_hist_line] 40: **** 9510340 [M::ha_hist_line] 41: 9421320 [M::ha_hist_line] 42: 9327132 [M::ha_hist_line] 43: ** 9227396 [M::ha_hist_line] 44: ** 9197687 [M::ha_hist_line] 45: ** 9118461 [M::ha_hist_line] 46: 9032115 [M::ha_hist_line] 47: 8962862 [M::ha_hist_line] 48: * 9044835 [M::ha_hist_line] 49: ** 9089925 [M::ha_hist_line] 50: ** 9147819 [M::ha_hist_line] 51: ** 9197964 [M::ha_hist_line] 52: *** 9283452 [M::ha_hist_line] 53: 9393173 [M::ha_hist_line] 54: 9467882 [M::ha_hist_line] 55: **** 9593509 [M::ha_hist_line] 56: **** 9607272 [M::ha_hist_line] 57: 9729602 [M::ha_hist_line] 58: 9815127 [M::ha_hist_line] 59: * 9815411 [M::ha_hist_line] 60: ** 9891586 [M::ha_hist_line] 61: ** 9961505 [M::ha_hist_line] 62: ** 10012431 [M::ha_hist_line] 63: ** 10041774 [M::ha_hist_line] 64: ** 9987964 [M::ha_hist_line] 65: ** 9995814 [M::ha_hist_line] 66: ** 9915976 [M::ha_hist_line] 67: * 9798204 [M::ha_hist_line] 68: * 9704423 [M::ha_hist_line] 69: ** 9557415 [M::ha_hist_line] 70: * 9457241 [M::ha_hist_line] 71: * 9304365 [M::ha_hist_line] 72: ** 9127224 [M::ha_hist_line] 73: **** 8875247 [M::ha_hist_line] 74: * 8559737 [M::ha_hist_line] 75: * 8264445 [M::ha_hist_line] 76: **** 7977267 [M::ha_hist_line] 77: ** 7678267 [M::ha_hist_line] 78: * 7342879 [M::ha_hist_line] 79: ** 7006039 [M::ha_hist_line] 80: 6635521 [M::ha_hist_line] 81: **** 6286913 [M::ha_hist_line] 82: ** 5937857 [M::ha_hist_line] 83: **** 5546121 [M::ha_hist_line] 84: ** 5198401 [M::ha_hist_line] 85: **** 4872826 [M::ha_hist_line] 86: ** 4534228 [M::ha_hist_line] 87: 4194903 [M::ha_hist_line] 88: **** 3905407 [M::ha_hist_line] 89: ** 3601831 [M::ha_hist_line] 90: ** 3330844 [M::ha_hist_line] 91: 3049526 [M::ha_hist_line] 92: ** 2759729 [M::ha_hist_line] 93: * 2553021 [M::ha_hist_line] 94: ** 2340541 [M::ha_hist_line] 95: * 2145361 [M::ha_hist_line] 96: ** 1958096 [M::ha_hist_line] 97: * 1781076 [M::ha_hist_line] 98: ** 1615652 [M::ha_hist_line] 99: * 1486306 [M::ha_hist_line] 100: * 1349445 [M::ha_hist_line] 101: **** 1233669 [M::ha_hist_line] 102: ** 1126847 [M::ha_hist_line] 103: * 1032527 [M::ha_hist_line] 104: * 952525 [M::ha_hist_line] 105: ** 887802 [M::ha_hist_line] 106: 829874 [M::ha_hist_line] 107: 790556 [M::ha_hist_line] 108: *** 739644 [M::ha_hist_line] 109: 692297 [M::ha_hist_line] 110: 644661 [M::ha_hist_line] 111: 611743 [M::ha_hist_line] 112: 583629 [M::ha_hist_line] 113: 559103 [M::ha_hist_line] 114: 527284 [M::ha_hist_line] 115: 504927 [M::ha_hist_line] 116: 488361 [M::ha_hist_line] 117: 467597 [M::ha_hist_line] 118: 450704 [M::ha_hist_line] 119: 436627 [M::ha_hist_line] 120: 425041 [M::ha_hist_line] 121: 413458 [M::ha_hist_line] 122: 404118 [M::ha_hist_line] 123: 389630 [M::ha_hist_line] 124: 383743 [M::ha_hist_line] 125: 376041 [M::ha_hist_line] 126: 365101 [M::ha_hist_line] 127: 358619 [M::ha_hist_line] 128: 347929 [M::ha_hist_line] 129: 336819 [M::ha_hist_line] 130: 333450 [M::ha_hist_line] 131: 325973 [M::ha_hist_line] 132: 319702 [M::ha_hist_line] 133: 313799 [M::ha_hist_line] 134: 309566 [M::ha_hist_line] 135: 304119 [M::ha_hist_line] 136: 296935 [M::ha_hist_line] 137: 288307 [M::ha_hist_line] 138: 284828 [M::ha_hist_line] 139: 280656 [M::ha_hist_line] 140: 277848 [M::ha_hist_line] 141: 269233 [M::ha_hist_line] 142: 263987 [M::ha_hist_line] 143: 260061 [M::ha_hist_line] 144: 251999 [M::ha_hist_line] 145: 250445 [M::ha_hist_line] 146: 246546 [M::ha_hist_line] 147: 241293 [M::ha_hist_line] 148: 240424 [M::ha_hist_line] 149: 233721 [M::ha_hist_line] 150: 229937 [M::ha_hist_line] 151: 225499 [M::ha_hist_line] 152: 223910 [M::ha_hist_line] 153: 215218 [M::ha_hist_line] 154: 214814 [M::ha_hist_line] 155: 212414 [M::ha_hist_line] 156: 207643 [M::ha_hist_line] 157: 206194 [M::ha_hist_line] 158: 201213 [M::ha_hist_line] 159: 198221 [M::ha_hist_line] 160: 191964 [M::ha_hist_line] 161: 190480 [M::ha_hist_line] 162: 187920 [M::ha_hist_line] 163: 183118 [M::ha_hist_line] 164: 181969 [M::ha_hist_line] 165: 179575 [M::ha_hist_line] 166: 176828 [M::ha_hist_line] 167: 176792 [M::ha_hist_line] 168: 172308 [M::ha_hist_line] 169: 167354 [M::ha_hist_line] 170: 164825 [M::ha_hist_line] 171: 161483 [M::ha_hist_line] 172: 162846 [M::ha_hist_line] 173: 159886 [M::ha_hist_line] 174: 157258 [M::ha_hist_line] 175: 153493 [M::ha_hist_line] 176: 148997 [M::ha_hist_line] 177: 145954 [M::ha_hist_line] 178: 145244 [M::ha_hist_line] 179: 144501 [M::ha_hist_line] 180: 140359 [M::ha_hist_line] 181: 138636 [M::ha_hist_line] 182: 140018 [M::ha_hist_line] 183: 135632 [M::ha_hist_line] 184: 136020 [M::ha_hist_line] 185: 132477 [M::ha_hist_line] 186: 135719 [M::ha_hist_line] 187: 133683 [M::ha_hist_line] 188: 130723 [M::ha_hist_line] 189: 128740 [M::ha_hist_line] 190: 126420 [M::ha_hist_line] 191: 121487 [M::ha_hist_line] 192: 122507 [M::ha_hist_line] 193: 119770 [M::ha_hist_line] 194: 115733 [M::ha_hist_line] 195: 115267 [M::ha_hist_line] 196: 115383 [M::ha_hist_line] 197: 115888 [M::ha_hist_line] 198: 114891 [M::ha_hist_line] 199: 114816 [M::ha_hist_line] 200: 113248 [M::ha_hist_line] 201: 108648 [M::ha_hist_line] 202: 106207 [M::ha_hist_line] 203: 107001 [M::ha_hist_line] 204: 104715 [M::ha_hist_line] 205: 104146 [M::ha_hist_line] 206: * 102514 [M::ha_hist_line] rest: ****> 20508566 [M::ha_analyze_count] left: none [M::ha_analyze_count] right: none [M::ha_ft_gen] peak_hom: 17; peak_het: -1
Which is odd, because my GenomeScope2 plots are pretty clean cut (at least to me) (see below):
From the plot I can clearly see that the homozygous and heterozygous peaks have ~74x and ~30x coverage, respectively. Is there a reason hifiasm is guessing wrongly these values? I can see that this was extensively discussing in the issues #55, #78, #156, however, I still did not find a solution for hifiasm to identify correctly the two peaks, either by forcing it (with --hom-cov) or leaving by the defaults.
Best, André
If it is tetraploid, then peak_hom = 17
might be right since it is roughly equal to the coverage per haplotype. For your sample, increasing -D
to 20 or -N
to 400 might be helpful. But in any way, the assembly should not be such fragmented unless it is a huge genome. Could you please also have a try with ultra-HiFi reads if the coverage is enough? To make sure it is the issue of hifiasm or data, another solution is to run HiCanu (instead of Canu). If HiCanu also does not work well, I guess it might be caused by the data quality.
Hello @chhylp123,
Thanks for the input. I will run HiCanu with the ultra-hifi and play a little bit with hifiasm following your parameter suggestions. I will close this thread and re-opened when I have more idea about what is happening. Just for clarification, this Genomescope plot correspond to the diploid organism (which I reconfirmed with smudgeplot as well).
Thanks again and let's see how the HiFi data will lead me.
Best, André
Hello again,
After months of troubleshooting and tears, I identified with the help of Pb bioinformaticians what was causing the issues with the assemblies and it was the employment of the ultra-low input HiFi protocol for sequencing my target species. The genome is relatively large and full of repeats which due the PCR-based nature of the protocol did not cover the whole breadth of the genome, causing the massive fragmentation and suboptimal results.
We have sequenced now using the low-input protocol five different individuals and the results are much better in terms of completeness and contiguity. However, since each low-input SMRT cell yields around 10-13x coverage, the draft assemblies are a bit fragmented and I would like to pool all the libraries together and obtain a more contiguous genomic reference. My attempts so far with the 2, 3, 4 and the 5 pooled libraries have been a failure. The pooled genome is highly fragmented and the reconstructed genome size is way bigger than expected. I am assuming that this might be the heterozigosity among the individuals, however, genomeScope2 plots show me that the heterozigosity did not change much from a single to pooled invididuals. Please see below:
2 libraries pooled (~23x coverage):
3 libraries pooled (~36x coverage):
4 libraries pooled (~51x coverage):
5 libraries pooled (~65x coverage):
If the genomeScope plots one can see clearly how the increase in coverage improves the detection of the heterozygous peak, and that the heterozygosity levels remain fairly the same. The genome size estimations and repeat content are quite stable too, which suggest me that the plots are trustworthy. I am puzzled why I cannot get better assemblies using the pooled libraries. Things that I noticed running hifiasm is:
Do you have any suggestions what can be done further to improve assembly contiguity using the pooled samples? I am running hi-flye at the moment to see if I can get better results, since flye tends to collapse different haplotigs. I am assuming that could improve the assembly if the issue is heterozygosity.
Best and thanks again for the help, André
Hi André @deoliveira86 ,
I have an issue which is very similar to yours. I was wondering if you have a solution?
Best, Jason
What worked the best was assembling the datasets individually then scaffolding the results later (e.g. ragtag). Good luck.Em 20.05.2024, à(s) 11:01, ishengtsai @.***> escreveu: Hi André @deoliveira86 , I have an issue which is very similar to yours. I was wondering if you have a solution? Best, Jason
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
Dear all,
I have been running hifiasm on two very repetitive genomes (>70% repeat content) using relatively deep HiFi libraries (60x-80x). The results are not that great and the genome assemblies are too fragmented (even considering the non-optimal insert size of the PacBio libraries generated - N50 ~ 10Kb). I check the FAQ and it looks like I can play around with some parameters, mainly -D or -N, to improve my assemblies. In the FAQ it is stated:
However, it is not specified what how high should I raise the default values. Considering that I am losing many true high-frequency k-mers due the high repeat content of my genome, and that might be one of the causes of my genome fragmentation, I am thinking to set unrealistic high numbers for -D and -N (e.g., 1000000) to overcome this issue. Is this advisable? Would another set of parameters be also useful to assemble highly repetitive genomes?
Best and I am looking forward for some guidance, André