Overpurging plant genome

dfguan / purge_dups

haplotypic duplication identification tool

MIT License

202 stars 19 forks source link

Hi Dengfeng,

I am using purge_dups to purge a plant genome generated with ~37x coverage of pacbio hifi reads. Cytological estimates for this species put the put the genome size 850Mb, and kmer analysis puts the genome size ~600Mb and heterozygosity around 1.7%.

Using the run_purge_dups.py script on our 1.5Gb Canu assembly I get the following Graph and cutoffs. cutoff_histogram

This results in most of the sequence ~1gb in the haplotigs file and the remaining ~500Mb in the primary contig. I also tried manually using the command calcuts -l 7 -m 25 -u 230 -d 2

Do you have any ideas of why this might be overpurging? Is this a repetitive content problem? Jellyfish/Genomescope did predict that ~300Mb of the genome was repetitive.

dfguan / purge_dups

Overpurging plant genome #80