dfguan / purge_dups

haplotypic duplication identification tool
MIT License
205 stars 20 forks source link

Detailed explanation of the purge_dups arguments #150

Open melop opened 2 weeks ago

melop commented 2 weeks ago

purge_dups can be tuned with setting various parameters and i did find a big difference in purging. For example with the default parameters I get 2.2G of purged sequences while setting some other parameters, I got 2.6Gs. There is a big reduction in BUSCO with the 2.2G version, so I am considering finding a better parameter set. But these parameters don't seem to be very well documented. Would it be possible to give a more detailed explanation of what each parameter mean? In particular, what are the ranges of some parameters (minimum alignment score, for example).

-f INT minimum fraction of haploid/diploid/bad/repetitive bases in a sequence [.8]

-a INT minimum alignment score [70]

-b INT minimum max match score [200]

-2 BOOL 2 rounds chaining [FALSE]

-m INT minimum matching bases for chaining [500]

-M INT maximum gap size for chaining [20K]

-G INT maximum gap size for 2nd round chaining [50K]

-l INT minimum chaining score for a match [10K]

-E INT maximum extension for contig ends [15K]