dfguan / purge_dups

haplotypic duplication identification tool
MIT License
204 stars 19 forks source link

Highly heterozygous species #76

Open wuxingbo1986 opened 3 years ago

wuxingbo1986 commented 3 years ago

Hi,

purge_dups was only able to separate very small set of my contigs. Do you have any suggestions for improvement?

I have Pacbio sequence reads and used default parameters.

Thanks.

dfguan commented 3 years ago

Hello, I guess the cutoffs may not be approproiate, you should adjust it based on the histogram of read depth of your data. Dengfeng.

wuxingbo1986 commented 3 years ago

Hi Dengfeng,

Thanks for your reply. What cutoffs will you recommend based on the histogram I have below?

PB cov

gitcruz commented 3 years ago

Hi Dengfeng,

I do have similar doubts "but" for two relatively homozygous mammalian genomes sequenced with 35 and 51x ONT coverage, respectively.

So far I used the assemblies with the default cutoffs and got good scaffolding results. However, I went through some of the examples in issue#14 and I understand that the 4th cutoff (2n?) is applied in between the heterozygous (1n) and the main peak (2n, homozygous peak and mean?).

In the first case, the 4th cutoffs is at 27x below what I would consider the homozygous peak (mean=35x at calcult.log): cutoffs=5 13 21 27 43 81

Po_PB cov

If my interpretation is right (2n=mean coverage=homozygous peak) then I would leave it like this, stick to the selected cutoffs and do not reset them manually. Is this right?

Second case, corresponds to a extremely "homozygous" genome, where the mean coverage was estimated to be 50x (as expected). However the fourth cutoff (diploid, middle coverage) is higher than this, 69x: [M::calcuts] mean: 50, peak: 46, mean larger than peak, treat as diploid assembly cutoffs: 5 35 57 69 115 207

Lp_PB cov

Here, I think the wiser thing to do would be re-setting cutoffs to -m 35 (leaving -l 5 and -u 207) and purging it again. What do you think?

Recapitulating: 1- Is it ok to keep case1 results? 2- Should I rerun case 2 with -m 35-38? 3- Should the 4th middle cutoff be 0.75 of mean coverage or right in the valley between heterozygous and homozygous peak? 4- Is there any way of automate this procedure without having to inspect the coverage histogram? it seems to be a problem for automatizing purge_dups or include it on a pipeline. Although I just compared several cases and 0.75 multiplied by the mean seems to be doing the work

Thanks in advance, Fernando

charlesfeigin commented 3 years ago

Bump for Fernando's question. I think I'm in a very similar case to his, but do not really understand how to use the histogram to alter my cutoffs (or an explanation of what each cutoff represents on the histogram.

I tried adjusting -m in a similar manner to Fernando's case 2 (i.e. going from a number around the middle of the slope of the right side of my single peak to the left) but this purged even less than the automatic cutoff, and I have essentially the same high rate of busco duplicates (~10% vs mammalia)

gitcruz commented 3 years ago

Hi Charles,

The generic explanation of the cutoffs meanings is found here:

https://github.com/dfguan/purge_dups/issues/14#issuecomment-547208728

Best, Fernando