dfguan / purge_dups

haplotypic duplication identification tool
MIT License
209 stars 21 forks source link

Good looking cutoffs, but almost no actual purging or change in BUSCO #129

Open igwill opened 1 year ago

igwill commented 1 year ago

Hello,

I am testing out purge_dups to clean up a fungal genome assembly that has high BUSCO duplication levels (>65%). After producing purged.fa, I ran some QC tests and the reduction in the assembly size was relatively small and the BUSCOs did not change. However, the cutoffs generated by purge_dups seemed quite sensible to me, if I were to set them manually, I would do the same. I'm interested in a producing a representative primary assembly, so have only used purge_dups "Steps" 1-3 (not doing the halpotype Step 4 in the docs). Any ideas?

1. Initial jellyfish + genomescope on the PacBio HiFI reads. Seems to show quite some heterozygosity: OHumb_genomescope_plot

2. Initial Hifiasm assembly stats (standard built-in purging allowed, trying to force high purging actually made it worse [-s 0.3 or 0.15]): Mbp Avg.Coverage Contigs N50 (Mbp) largest contig BUSCO-C BUSCO-S BUSCO-D
95.3 41 45 11.1 17.5 97.1 31.9 65.2

3a. purge_dups hist plot and cutoffs: [M::calcuts] Find 2 peaks [M::calcuts] Merge local peaks and valleys: 2 peaks remain [M::calcuts] Remove peaks and valleys less than 5: 2 peaks remain [M::calcuts] Use top 3 frequent read depth [M::calcuts] Found a valley in the middle of the peaks, use two-peak mode 5 14 30 31 57 132 steps1-3_hist

3b. dups.bed contents: ptg000037l 1 22275 JUNK ptg000015l 1 152862 HIGHCOV ptg000023c 1 33723 HIGHCOV ## my note: this is probably a mitochondrial contig ("c" for circular) and should be added back to the assembly later ptg000024l 1 11411 HIGHCOV ptg000029l 1 9179 JUNK ptg000038l 1 8151 JUNK ptg000012l 0 15371 HAPLOTIG ptg000001l ptg000018l 0 18216 HAPLOTIG ptg000001l ptg000017l 0 25604 HAPLOTIG ptg000001l ptg000031l 0 19747 HAPLOTIG ptg000008l ptg000028l 0 25015 HAPLOTIG ptg000001l ptg000027l 0 26156 HAPLOTIG ptg000001l ptg000030l 0 17533 HAPLOTIG ptg000001l ptg000041l 0 16097 HAPLOTIG ptg000001l ptg000022l 0 19241 HAPLOTIG ptg000001l ptg000039l 0 17700 HAPLOTIG ptg000001l ptg000045l 0 18957 HAPLOTIG ptg000008l ptg000044l 0 16238 HAPLOTIG ptg000001l ptg000043l 0 15884 HAPLOTIG ptg000001l ptg000019l 0 22260 REPEAT ptg000008l ptg000021l 0 17577 HAPLOTIG ptg000035l ptg000026l 0 14896 HAPLOTIG ptg000001l ptg000034l 0 14029 HAPLOTIG ptg000001l ptg000042l 0 13008 HAPLOTIG ptg000001l ptg000033l 0 12455 HAPLOTIG ptg000008l ptg000032l 0 14893 HAPLOTIG ptg000001l ptg000036l 0 10395 HAPLOTIG ptg000001l

4. QC check after purging. Removed 27 contigs (over half), but only reduced size by 0.6 Mbp, and did not change BUSCOs : Mbp Avg.Coverage Contigs N50 (Mbp) largest contig BUSCO-C BUSCO-S BUSCO-D
94.7 39 18 11.1 17.5 97.1 31.9 65.2

5. Reran PB.stat on output purged.fa and saw essentially no change - only a tiny little blip around 2x coverage has been removed compared to the above historgram based on the initial assembly: post_purge_hist

Any recommendations on cleaning this up? Thanks!