I am testing out purge_dups to clean up a fungal genome assembly that has high BUSCO duplication levels (>65%). After producing purged.fa, I ran some QC tests and the reduction in the assembly size was relatively small and the BUSCOs did not change. However, the cutoffs generated by purge_dups seemed quite sensible to me, if I were to set them manually, I would do the same.
I'm interested in a producing a representative primary assembly, so have only used purge_dups "Steps" 1-3 (not doing the halpotype Step 4 in the docs).
Any ideas?
1. Initial jellyfish + genomescope on the PacBio HiFI reads. Seems to show quite some heterozygosity:
2. Initial Hifiasm assembly stats (standard built-in purging allowed, trying to force high purging actually made it worse [-s 0.3 or 0.15]):
Mbp
Avg.Coverage
Contigs
N50 (Mbp)
largest contig
BUSCO-C
BUSCO-S
BUSCO-D
95.3
41
45
11.1
17.5
97.1
31.9
65.2
3a. purge_dups hist plot and cutoffs:
[M::calcuts] Find 2 peaks
[M::calcuts] Merge local peaks and valleys: 2 peaks remain
[M::calcuts] Remove peaks and valleys less than 5: 2 peaks remain
[M::calcuts] Use top 3 frequent read depth
[M::calcuts] Found a valley in the middle of the peaks, use two-peak mode
5 14 30 31 57 132
4. QC check after purging. Removed 27 contigs (over half), but only reduced size by 0.6 Mbp, and did not change BUSCOs :
Mbp
Avg.Coverage
Contigs
N50 (Mbp)
largest contig
BUSCO-C
BUSCO-S
BUSCO-D
94.7
39
18
11.1
17.5
97.1
31.9
65.2
5. Reran PB.stat on output purged.fa and saw essentially no change - only a tiny little blip around 2x coverage has been removed compared to the above historgram based on the initial assembly:
Hello,
I am testing out purge_dups to clean up a fungal genome assembly that has high BUSCO duplication levels (>65%). After producing purged.fa, I ran some QC tests and the reduction in the assembly size was relatively small and the BUSCOs did not change. However, the cutoffs generated by purge_dups seemed quite sensible to me, if I were to set them manually, I would do the same. I'm interested in a producing a representative primary assembly, so have only used purge_dups "Steps" 1-3 (not doing the halpotype Step 4 in the docs). Any ideas?
1. Initial jellyfish + genomescope on the PacBio HiFI reads. Seems to show quite some heterozygosity:
3a. purge_dups hist plot and cutoffs: [M::calcuts] Find 2 peaks [M::calcuts] Merge local peaks and valleys: 2 peaks remain [M::calcuts] Remove peaks and valleys less than 5: 2 peaks remain [M::calcuts] Use top 3 frequent read depth [M::calcuts] Found a valley in the middle of the peaks, use two-peak mode 5 14 30 31 57 132
3b. dups.bed contents: ptg000037l 1 22275 JUNK ptg000015l 1 152862 HIGHCOV ptg000023c 1 33723 HIGHCOV ## my note: this is probably a mitochondrial contig ("c" for circular) and should be added back to the assembly later ptg000024l 1 11411 HIGHCOV ptg000029l 1 9179 JUNK ptg000038l 1 8151 JUNK ptg000012l 0 15371 HAPLOTIG ptg000001l ptg000018l 0 18216 HAPLOTIG ptg000001l ptg000017l 0 25604 HAPLOTIG ptg000001l ptg000031l 0 19747 HAPLOTIG ptg000008l ptg000028l 0 25015 HAPLOTIG ptg000001l ptg000027l 0 26156 HAPLOTIG ptg000001l ptg000030l 0 17533 HAPLOTIG ptg000001l ptg000041l 0 16097 HAPLOTIG ptg000001l ptg000022l 0 19241 HAPLOTIG ptg000001l ptg000039l 0 17700 HAPLOTIG ptg000001l ptg000045l 0 18957 HAPLOTIG ptg000008l ptg000044l 0 16238 HAPLOTIG ptg000001l ptg000043l 0 15884 HAPLOTIG ptg000001l ptg000019l 0 22260 REPEAT ptg000008l ptg000021l 0 17577 HAPLOTIG ptg000035l ptg000026l 0 14896 HAPLOTIG ptg000001l ptg000034l 0 14029 HAPLOTIG ptg000001l ptg000042l 0 13008 HAPLOTIG ptg000001l ptg000033l 0 12455 HAPLOTIG ptg000008l ptg000032l 0 14893 HAPLOTIG ptg000001l ptg000036l 0 10395 HAPLOTIG ptg000001l
5. Reran PB.stat on output purged.fa and saw essentially no change - only a tiny little blip around 2x coverage has been removed compared to the above historgram based on the initial assembly:
Any recommendations on cleaning this up? Thanks!