Open Brent-Saylor-Canopy opened 3 years ago
Yes, I am afraid this is due to the repetitive sequences. Actually I have never tried purge_dups on the highly repetitive plant genomes. I think you can try purge_halotigs (https://bitbucket.org/mroachawri/purge_haplotigs/src/master/), and you can mask your genomes first and then input also the repeat annotation file to purge_haplotigs. Best, Dengfeng.
Hi Dengfeng,
I am using purge_dups to purge a plant genome generated with ~37x coverage of pacbio hifi reads. Cytological estimates for this species put the put the genome size 850Mb, and kmer analysis puts the genome size ~600Mb and heterozygosity around 1.7%.
Using the run_purge_dups.py script on our 1.5Gb Canu assembly I get the following Graph and cutoffs.![cutoff_histogram](https://user-images.githubusercontent.com/69979021/112183472-a905f680-8bd4-11eb-99cb-a00b634d0e8c.png)
This results in most of the sequence ~1gb in the haplotigs file and the remaining ~500Mb in the primary contig. I also tried manually using the command
calcuts -l 7 -m 25 -u 230 -d 2
Do you have any ideas of why this might be overpurging? Is this a repetitive content problem? Jellyfish/Genomescope did predict that ~300Mb of the genome was repetitive.