chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
524 stars 86 forks source link

Strange BUSCO duplicate in hifiasm -l3 only vs hifiasm -l1+purge_dups #464

Open Nicholas-Kron opened 1 year ago

Nicholas-Kron commented 1 year ago

I am currently assembling a teleost genome and I was comparing using hifiasm alone as advocated by the developer vs hifiasm+purge_dups as recommended by the Vertebrate Genome Project. The comparison was between a hifiasm assembly with -l 3 and -s 0.45 (primary_asm_s45) vs hifiasm with -l 1and then purge_dups with -a 80 (hifiasm_asm_a80.purged). Both assemblies used estimates from genomescope2 for purging parameters (ie --hg-size and --purge-maxfor the pure hifiasm assembly). Generally speaking, they are very comparable.

The merqury kmer profiles looks almost identical for the primary assemblies of each and the the kmer completeness is very close (90 vs 89.5).

The pure hifiasm assembly looking a bit nicer in terms of length and contiguity in the quest report (replicated at the bottom).

BUSCO is very similar too, the two assemblies differ in only 14/3640 BUSCOs. I have read the developer's feelings on BUSCO assessment, namely that it is underpowered and it is had to distinguish setdups and other real duplications. Comparing the BUSCO reports of the two assemblies, again very similar ( busco_overlap_s45_a80.pdf ). This is probably splitting hairs at this point, but the duplication results were a bit strange to me.

Some I expected, namely that the purge hifiasm has complete single copy BUSCOs for two fragmentary and one missing in the purge_dups assembly. The pure hifiasm does have a higher number of duplicate BUSCOs, 45 vs 40 in the purge_dups. interestingly, 8 of the duplicates are marked complete in the purged assembly, while only 3 duplicates in the purged assembly are marked as complete in the pure hifiasm assembly. Overall very similar. The weird part is when I looked at the duplicates themselves. The shared duplicates between the assemblies are identical, both in terms of size and the number of contigs they appear on. The duplicates unique to each assembly only appear on two and rarely 3 contigs, except for one. BUSCO zinc finger protein Gfi-1b (83129at7898) appears 13 times across 12 different contigs, 5 times in forward and 8 on reverse, in the pure hifiasm assembly. The contigs it appears on range from 1Mb to 68Mb in size. It is marked as complete single copy in the purge_dups assembly.

I am not really sure how to interpret this. It looks like an artifact to me. I chose the -s 0.45 for more stringent purging than default parameters which also had this duplicate, but should I opt for an even lower value to prune out this strange duplicate? The estimated heterozygosity is between 0.87% and 0.91%, so nothing crazy. Based on various quality metrics the pure hifiasm assembly seems to be better but this weird duplication gives me pause. I would appreciate any advice you may have on this.

Assembly primary_asm_s45 primary_asm_a80.purged
# contigs (>= 0 bp) 624 490
# contigs (>= 1000 bp) 624 490
# contigs (>= 5000 bp) 624 490
# contigs (>= 10000 bp) 624 490
# contigs (>= 25000 bp) 595 485
# contigs (>= 50000 bp) 494 464
Total length (>= 0 bp) 2265825488 2152282050
Total length (>= 1000 bp) 2265825488 2152282050
Total length (>= 5000 bp) 2265825488 2152282050
Total length (>= 10000 bp) 2265825488 2152282050
Total length (>= 25000 bp) 2265223997 2152176196
Total length (>= 50000 bp) 2261654608 2151377480
# contigs 624 490
Largest contig 68564173 53083750
Total length 2265825488 2152282050
GC (%) 42.01 41.99
N50 24641502 16888119
N90 2123137 2228262
auN 26067506.4 19295668.8
L50 30 38
L90 156 167
# N's per 100 kbp 0 0
chhylp123 commented 1 year ago

I feel like busco results might be not such accurate. In most cases, duplicated genes come from the corresponding homologous regions which haven't been fully purged. Probably you can check how do the corresponding nodes of these duplicated gene looks like in the assembly graph. I guess you should have a sense if these genes come from the homologous regions.

Nicholas-Kron commented 1 year ago

Taking a look at the GFA in Bandage, the specific contigs do not share edges (no bubbles or forks). None of the contigs have unusual GC content or read average read depth either. I am currently running a mummer alignment to gauge the level of sequence level similarity.

Interestingly, scaffolding the purge_dups assembly with 5 rounds of ntLink using the hifireads takes this BUSCO from being single copy complete to having 5 duplicates, 2 on the positive and 3 on the negative strands across 4 scaffolds. So using hifireads to gap fill and scaffold seems to reintroduce the duplicates. Lowering the s parameter to 0.35 in hifiasm does not lower the number of repeats.

chhylp123 commented 1 year ago

Thanks... So it is hard to say