chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
530 stars 87 forks source link

Will it over purge when using purge_dups? #70

Closed yywyaoyaowu closed 3 years ago

yywyaoyaowu commented 3 years ago

Hi I found the hifiasm will purge assembly when use default -l parameter, which got the same hifiasm.p_ctg.fasta as "-l 2". And when I use default parameter to assembly, and the use 'purge_dups' purging , it still purge more. But my college said he got the idea purge.sequence size when they use default parameter, then 'purge_dups'. Thus, I wonder is the default parameter really purge? And will it double purge when using purge_dups

chhylp123 commented 3 years ago

By purge_dups, do you mean: https://github.com/dfguan/purge_dups ?

yywyaoyaowu commented 3 years ago

Yes, we use this to purge.

chhylp123 commented 3 years ago

I personally think hifiasm's internal purge_dups is more accurate than external purge_dups, maybe I'm wrong. However, for some regions with high het rate, hifiasm itself may not be able to do enough purging. So I recommend you to have a try with all possible combinations, including (hifiasm -l0 + external purge_dups), (hifiasm -l1 + external purge_dups) and (hifiasm -l2 + external purge_dups).

xinghua1001 commented 3 years ago

I personally think hifiasm's internal purge_dups is more accurate than external purge_dups, maybe I'm wrong. However, for some regions with high het rate, hifiasm itself may not be able to do enough purging. So I recommend you to have a try with all possible combinations, including (hifiasm -l0 + external purge_dups), (hifiasm -l1 + external purge_dups) and (hifiasm -l2 + external purge_dups).

In my assembly ( nearly 2% heterozygosity), the default parameter got the same genome size as previously estimated(6.3G), with a more fragmented alternative assembly of 5.2G, which I thought to be a perfect result. However, I later found that the duplicated BUSCOs (4.1.4) of the primary contigs is a little high(8.4%). So I use purge_dups to purge the primary assembly and got a purged.fa of 5.6G, and the duplicated BUSCOs decreased to 3.4%. The contig N50 raised from 52Mb to 59Mb. The contig number decreased from 2917 to 283.

I don't know whether purge_dups over purge or hifiasm insufficiently purge. However, the default purge is already the most aggressive.

I'm wondering if hifiasm is suitable for assembling such a high heterozygous genome. Maybe Falcon-unzip will be better?

The sequencing depth is around 35x.

chhylp123 commented 3 years ago

I'm wondering if hifiasm is suitable for assembling such a high heterozygous genome. Maybe Falcon-unzip will be better?

If Falcon-unzip is not redesigned for HiFi data, we shouldn't use Falcon-unzip. Falcon-unzip may get smaller assembly but it is very likely to collapse segdups/different haplotypes. As for BUSCOs, it just gives an overall result, which is not such accurate.

xinghua1001 commented 3 years ago

I'm wondering if hifiasm is suitable for assembling such a high heterozygous genome. Maybe Falcon-unzip will be better?

If Falcon-unzip is not redesigned for HiFi data, we shouldn't use Falcon-unzip. Falcon-unzip may get smaller assembly but it is very likely to collapse segdups/different haplotypes. As for BUSCOs, it just gives an overall result, which is not such accurate.

Thanks Haoyu. It's amazing that hifiasm can assembly huge complex genome so fast with high quality. I'm just a bit worried that it will not purge enough at highly heterozygous region. I have tried the --high-het parameter but the result is not as good as the default parameter. And I'm not sure whether to use purge_dups at the primary assembly.

chhylp123 commented 3 years ago

I'm not sure, either. If the heterozygosity rate is even, 2% heterozygosity is not too high for hifiasm's purge_dups. However, we have observed in some cases, the heterozygosity rate is very uneven, so that some regions may have much higher heterozygosity rate. In this case, even purge_dups can remove duplicates, it may introduce other problems like misassemblies or collapsed segdups, and the primary assembly is not such representative. The ideal solution is to generate fully phased two haplotypes for diploid samples.

xinghua1001 commented 3 years ago

The size of the alternative assembly from hifiasm is really reasonable in my genome, likely thanks to the high heterozygosity. I've also got the HiC data. Is it feasible to generate fully phased genome combining hifiasm output and the HiC data, maybe in the next version of hifiasm with the feature to include the HiC data?

chhylp123 commented 3 years ago

Yean, we're testing it, which should be released soon.

xinghua1001 commented 3 years ago

That's awesome! Very much looking forward to it !!!

shilpagarg commented 3 years ago

Alternatively, perhaps you may try pstools for using Hi-C on Hifi graph. This tool is tested properly for humans. (I believe Haoyu is aware of this). Please see https://github.com/shilpagarg/DipAsm/issues/16 for guidelines.

chhylp123 commented 3 years ago

Yean, you can also have a try with Shilpa's pstools. Different tools may have different performance.

xinghua1001 commented 3 years ago

Using pstools or the next version of hifiasm, is it necessary to use hi-c scaffolding tools like 3d-dna anymore?

shilpagarg commented 3 years ago

The pstools method does phasing-aware scaffolding, no third party tool required. Additionally, it is also helpful for evaluation of phased sequences using hi-c. For example, in case any assembler takes random walk through the graph, Hi-C based evaluation in pstools could be helpful. Let me know if you have any questions.

chhylp123 commented 3 years ago

@xinghua1001 Hifiasm v0.14 supports Hi-C phased mode, but it is not stable. It is still under development. Please let me know if you have any problems when using it.

skyungyong commented 3 years ago

Hi @xinghua1001,

Do you have any updates on your assembly? I would be curious to hear. My plant genome was estimated to have ~ 1.5% heterozygosity and 0.85G in size by genomescope. The genome size estimation is reasonable as a related species with a great reference genome has a similar genome size.

With hifiasm, I assembled 1.1G primary contigs with N50 of 4Mb. Alternative contigs are about 0.82G. I tried IPA as an alternative, and this assembler gave 0.92G contigs with N50 of 1.3Mb. Both assemblies have similar BUSCO stats. But the hifiasm assembly has 2% more single copy x duplicated genes. 0.2G differences in the genome assembly seemed a lot to me. I wonder how you eventually handled your data.

chhylp123 commented 3 years ago

@skyungyong Jut curious: what's the size of IPA's alternative contigs?

lh3 commented 3 years ago

the hifiasm assembly has 2% more single copy x duplicated genes.

What's the BUSCO completeness (single+duplicated)? If it is 56% vs 58%, a 2% difference could just be statistical fluctuation. If it is 96% vs 98%, 2% is a lot. PS: also, what is percent duplicated genes according to BUSCO? If purging is insufficient, you usually see a large %duplicated.

skyungyong commented 3 years ago

Sorry, I meant that the proportion of the complete single copy genes is similar in both assemblies, but the duplicated ones appear more frequently in the hifiasm output. Here is the statistics:

IPA assembly: final.p_ctg.fasta 880M final.a_ctg.fasta 713M

Quast for final.p_ctg.fasta Num. contigs 1196 Largest contig 6450199 Total length 922015288 GC (%) 34.61 N50 1369395 N75 765551 L50 206 L75 433

BUSCO for final.p_ctg.fasta

    |Results from dataset solanales_odb10             |
    --------------------------------------------------
    |C:96.9%[S:94.4%,D:2.5%],F:0.6%,M:2.5%,n:5950     |
    |5765    Complete BUSCOs (C)                                    |
    |5619    Complete and single-copy BUSCOs (S)          |
    |146      Complete and duplicated BUSCOs (D)           |
    |33        Fragmented BUSCOs (F)                                |
    |152      Missing BUSCOs (M)                                      |
    |5950    Total BUSCO groups searched                       |
    --------------------------------------------------

hifiasm assembly:

SH1353.asm.p_ctg.fa 1.1G SH1353.asm.a_ctg.fa 787M

Quast for SH1353.asm.p_ctg.fa Num. contigs 2564 Largest contig 15227863 Total length 1108360991 GC (%) 35.53 N50 4129401 N75 1916824 L50 78 L75 178

BUSCO for SH1353.asm.p_ctg.fa

    |Results from dataset solanales_odb10             |
    --------------------------------------------------
    |C:97.3%[S:92.8%,D:4.5%],F:0.3%,M:2.4%,n:5950     |
    |5790   Complete BUSCOs (C)                       |
    |5524   Complete and single-copy BUSCOs (S)       |
    |266    Complete and duplicated BUSCOs (D)        |
    |19     Fragmented BUSCOs (F)                     |
    |141    Missing BUSCOs (M)                        |
    |5950   Total BUSCO groups searched               |
    --------------------------------------------------
lh3 commented 3 years ago

It seems that hifiasm's dup purging may not be aggressive enough. You may try the standalone purge_dups in this case and see what the result looks like. If you have Hi-C data, it is also worth trying hifiasm's new Hi-C mode.

chhylp123 commented 3 years ago

My plant genome was estimated to have ~ 1.5% heterozygosity and 0.85G in size by genomescope

In our experiments with human, plants and animals, genomescope always gives smaller genome size. For example, genomescope thinks the haploid genome size of human is about 2.7Gb, while the real size is 3.1Gb.

0.2G differences in the genome assembly seemed a lot to me

The total size of hifiasm assembly (primary contigs + alternative contigs) is 1.2 times larger than that of IPA. It is hard for an assembler to generate such large additional wrong contigs. In most cases, larger assembly is better than a smaller one (maybe I'm wrong). Probably you can check what are these extra regions. If they are segdups/repeats, I guess it should be right. You can also assemble with HiCanu and use its assembly for double checking.

Both assemblies have similar BUSCO stats

For HiFi assemblies, BUSCO is underpower. It is hard to evaluate the resolution of segdups/repeats.

0.2G differences in the genome assembly seemed a lot to me.. But the hifiasm assembly has 2% more single copy x duplicated genes

As Heng said, another reason for higher duplicated genes is that current hifiasm may not be able to do sufficient purging, so that p_ctg is a little bit larger. I just realized why hifiasm cannot do enough purging and hopefully it could be fixed in the next version.

skyungyong commented 3 years ago

Thank you for the suggestions! The output size from HiCanu is much closer to the hifiasm assembly. I will also look forward to the update!

zuodabin commented 5 months ago

Using default parameters and HIFI data, the size I assembled was similar to the estimated size, but the repetition rate was a little higher (C:94.7%[S:77.9%,D:16.8%],F:1.4%,M:3.9%,n:5310,E:3.1%). I used the extra tool purge_dups to polish the genome, but the assembly size was about 700M less. Is there too much purge? Do I need to use purge_dups? Thank you very much!

chhylp123 commented 4 months ago

@zuodabin One solution is to manually check where those duplicated genes are. Some of them may not the real duplicated genes.