dfguan / purge_dups

haplotypic duplication identification tool
MIT License
202 stars 19 forks source link

Overpurging plant genome #80

Open Brent-Saylor-Canopy opened 3 years ago

Brent-Saylor-Canopy commented 3 years ago

Hi Dengfeng,

I am using purge_dups to purge a plant genome generated with ~37x coverage of pacbio hifi reads. Cytological estimates for this species put the put the genome size 850Mb, and kmer analysis puts the genome size ~600Mb and heterozygosity around 1.7%.

Using the run_purge_dups.py script on our 1.5Gb Canu assembly I get the following Graph and cutoffs. cutoff_histogram

This results in most of the sequence ~1gb in the haplotigs file and the remaining ~500Mb in the primary contig. I also tried manually using the command calcuts -l 7 -m 25 -u 230 -d 2

Do you have any ideas of why this might be overpurging? Is this a repetitive content problem? Jellyfish/Genomescope did predict that ~300Mb of the genome was repetitive.

dfguan commented 3 years ago

Yes, I am afraid this is due to the repetitive sequences. Actually I have never tried purge_dups on the highly repetitive plant genomes. I think you can try purge_halotigs (https://bitbucket.org/mroachawri/purge_haplotigs/src/master/), and you can mask your genomes first and then input also the repeat annotation file to purge_haplotigs. Best, Dengfeng.