Closed yywyaoyaowu closed 3 years ago
By purge_dups, do you mean: https://github.com/dfguan/purge_dups ?
Yes, we use this to purge.
I personally think hifiasm's internal purge_dups is more accurate than external purge_dups, maybe I'm wrong. However, for some regions with high het rate, hifiasm itself may not be able to do enough purging. So I recommend you to have a try with all possible combinations, including (hifiasm -l0 + external purge_dups), (hifiasm -l1 + external purge_dups) and (hifiasm -l2 + external purge_dups).
I personally think hifiasm's internal purge_dups is more accurate than external purge_dups, maybe I'm wrong. However, for some regions with high het rate, hifiasm itself may not be able to do enough purging. So I recommend you to have a try with all possible combinations, including (hifiasm -l0 + external purge_dups), (hifiasm -l1 + external purge_dups) and (hifiasm -l2 + external purge_dups).
In my assembly ( nearly 2% heterozygosity), the default parameter got the same genome size as previously estimated(6.3G), with a more fragmented alternative assembly of 5.2G, which I thought to be a perfect result. However, I later found that the duplicated BUSCOs (4.1.4) of the primary contigs is a little high(8.4%). So I use purge_dups to purge the primary assembly and got a purged.fa of 5.6G, and the duplicated BUSCOs decreased to 3.4%. The contig N50 raised from 52Mb to 59Mb. The contig number decreased from 2917 to 283.
I don't know whether purge_dups over purge or hifiasm insufficiently purge. However, the default purge is already the most aggressive.
I'm wondering if hifiasm is suitable for assembling such a high heterozygous genome. Maybe Falcon-unzip will be better?
The sequencing depth is around 35x.
I'm wondering if hifiasm is suitable for assembling such a high heterozygous genome. Maybe Falcon-unzip will be better?
If Falcon-unzip is not redesigned for HiFi data, we shouldn't use Falcon-unzip. Falcon-unzip may get smaller assembly but it is very likely to collapse segdups/different haplotypes. As for BUSCOs, it just gives an overall result, which is not such accurate.
I'm wondering if hifiasm is suitable for assembling such a high heterozygous genome. Maybe Falcon-unzip will be better?
If Falcon-unzip is not redesigned for HiFi data, we shouldn't use Falcon-unzip. Falcon-unzip may get smaller assembly but it is very likely to collapse segdups/different haplotypes. As for BUSCOs, it just gives an overall result, which is not such accurate.
Thanks Haoyu. It's amazing that hifiasm can assembly huge complex genome so fast with high quality. I'm just a bit worried that it will not purge enough at highly heterozygous region. I have tried the --high-het parameter but the result is not as good as the default parameter. And I'm not sure whether to use purge_dups at the primary assembly.
I'm not sure, either. If the heterozygosity rate is even, 2% heterozygosity is not too high for hifiasm's purge_dups. However, we have observed in some cases, the heterozygosity rate is very uneven, so that some regions may have much higher heterozygosity rate. In this case, even purge_dups can remove duplicates, it may introduce other problems like misassemblies or collapsed segdups, and the primary assembly is not such representative. The ideal solution is to generate fully phased two haplotypes for diploid samples.
The size of the alternative assembly from hifiasm is really reasonable in my genome, likely thanks to the high heterozygosity. I've also got the HiC data. Is it feasible to generate fully phased genome combining hifiasm output and the HiC data, maybe in the next version of hifiasm with the feature to include the HiC data?
Yean, we're testing it, which should be released soon.
That's awesome! Very much looking forward to it !!!
Alternatively, perhaps you may try pstools for using Hi-C on Hifi graph. This tool is tested properly for humans. (I believe Haoyu is aware of this). Please see https://github.com/shilpagarg/DipAsm/issues/16 for guidelines.
Yean, you can also have a try with Shilpa's pstools. Different tools may have different performance.
Using pstools or the next version of hifiasm, is it necessary to use hi-c scaffolding tools like 3d-dna anymore?
The pstools method does phasing-aware scaffolding, no third party tool required. Additionally, it is also helpful for evaluation of phased sequences using hi-c. For example, in case any assembler takes random walk through the graph, Hi-C based evaluation in pstools could be helpful. Let me know if you have any questions.
@xinghua1001 Hifiasm v0.14 supports Hi-C phased mode, but it is not stable. It is still under development. Please let me know if you have any problems when using it.
Hi @xinghua1001,
Do you have any updates on your assembly? I would be curious to hear. My plant genome was estimated to have ~ 1.5% heterozygosity and 0.85G in size by genomescope. The genome size estimation is reasonable as a related species with a great reference genome has a similar genome size.
With hifiasm, I assembled 1.1G primary contigs with N50 of 4Mb. Alternative contigs are about 0.82G. I tried IPA as an alternative, and this assembler gave 0.92G contigs with N50 of 1.3Mb. Both assemblies have similar BUSCO stats. But the hifiasm assembly has 2% more single copy x duplicated genes. 0.2G differences in the genome assembly seemed a lot to me. I wonder how you eventually handled your data.
@skyungyong Jut curious: what's the size of IPA's alternative contigs?
the hifiasm assembly has 2% more single copy x duplicated genes.
What's the BUSCO completeness (single+duplicated)? If it is 56% vs 58%, a 2% difference could just be statistical fluctuation. If it is 96% vs 98%, 2% is a lot. PS: also, what is percent duplicated genes according to BUSCO? If purging is insufficient, you usually see a large %duplicated.
Sorry, I meant that the proportion of the complete single copy genes is similar in both assemblies, but the duplicated ones appear more frequently in the hifiasm output. Here is the statistics:
IPA assembly: final.p_ctg.fasta 880M final.a_ctg.fasta 713M
Quast for final.p_ctg.fasta Num. contigs 1196 Largest contig 6450199 Total length 922015288 GC (%) 34.61 N50 1369395 N75 765551 L50 206 L75 433
|Results from dataset solanales_odb10 |
--------------------------------------------------
|C:96.9%[S:94.4%,D:2.5%],F:0.6%,M:2.5%,n:5950 |
|5765 Complete BUSCOs (C) |
|5619 Complete and single-copy BUSCOs (S) |
|146 Complete and duplicated BUSCOs (D) |
|33 Fragmented BUSCOs (F) |
|152 Missing BUSCOs (M) |
|5950 Total BUSCO groups searched |
--------------------------------------------------
hifiasm assembly:
SH1353.asm.p_ctg.fa 1.1G SH1353.asm.a_ctg.fa 787M
Quast for SH1353.asm.p_ctg.fa Num. contigs 2564 Largest contig 15227863 Total length 1108360991 GC (%) 35.53 N50 4129401 N75 1916824 L50 78 L75 178
|Results from dataset solanales_odb10 |
--------------------------------------------------
|C:97.3%[S:92.8%,D:4.5%],F:0.3%,M:2.4%,n:5950 |
|5790 Complete BUSCOs (C) |
|5524 Complete and single-copy BUSCOs (S) |
|266 Complete and duplicated BUSCOs (D) |
|19 Fragmented BUSCOs (F) |
|141 Missing BUSCOs (M) |
|5950 Total BUSCO groups searched |
--------------------------------------------------
It seems that hifiasm's dup purging may not be aggressive enough. You may try the standalone purge_dups in this case and see what the result looks like. If you have Hi-C data, it is also worth trying hifiasm's new Hi-C mode.
My plant genome was estimated to have ~ 1.5% heterozygosity and 0.85G in size by genomescope
In our experiments with human, plants and animals, genomescope always gives smaller genome size. For example, genomescope thinks the haploid genome size of human is about 2.7Gb, while the real size is 3.1Gb.
0.2G differences in the genome assembly seemed a lot to me
The total size of hifiasm assembly (primary contigs + alternative contigs) is 1.2 times larger than that of IPA. It is hard for an assembler to generate such large additional wrong contigs. In most cases, larger assembly is better than a smaller one (maybe I'm wrong). Probably you can check what are these extra regions. If they are segdups/repeats, I guess it should be right. You can also assemble with HiCanu and use its assembly for double checking.
Both assemblies have similar BUSCO stats
For HiFi assemblies, BUSCO is underpower. It is hard to evaluate the resolution of segdups/repeats.
0.2G differences in the genome assembly seemed a lot to me.. But the hifiasm assembly has 2% more single copy x duplicated genes
As Heng said, another reason for higher duplicated genes is that current hifiasm may not be able to do sufficient purging, so that p_ctg is a little bit larger. I just realized why hifiasm cannot do enough purging and hopefully it could be fixed in the next version.
Thank you for the suggestions! The output size from HiCanu is much closer to the hifiasm assembly. I will also look forward to the update!
Using default parameters and HIFI data, the size I assembled was similar to the estimated size, but the repetition rate was a little higher (C:94.7%[S:77.9%,D:16.8%],F:1.4%,M:3.9%,n:5310,E:3.1%). I used the extra tool purge_dups to polish the genome, but the assembly size was about 700M less. Is there too much purge? Do I need to use purge_dups? Thank you very much!
@zuodabin One solution is to manually check where those duplicated genes are. Some of them may not the real duplicated genes.
Hi I found the hifiasm will purge assembly when use default -l parameter, which got the same hifiasm.p_ctg.fasta as "-l 2". And when I use default parameter to assembly, and the use 'purge_dups' purging , it still purge more. But my college said he got the idea purge.sequence size when they use default parameter, then 'purge_dups'. Thus, I wonder is the default parameter really purge? And will it double purge when using purge_dups