dfguan / purge_dups

haplotypic duplication identification tool
MIT License
202 stars 19 forks source link

Set up cutoffs manually #14

Open Pikayy opened 4 years ago

Pikayy commented 4 years ago

Hi, I have been trying to manually select the cutoffs by running "hist_plot.py". As I understand, we are supposed to select cutoff values from the count-read_depth image and manually change the "cutoffs" file. The "cutoffs" file contains one line with 6 numbers. Could you explain the meaning of each of the 6 numbers and which one should I look at to validate the cutoff results?

Many thanks!

dfguan commented 4 years ago

Hi Xuyue, The cutoffs usually can be selected automatically, but it may select wrong cutoffs sometimes, you can use hist_plot.py to validate them visually by using the command "python3 hist_plot.py -c cutoff_file PB.stat PB.cov.png". As for the six numbers, the first is low coverage, any contig with an average coverge less than the number is a junk contig, the forth number is diploid (middle) coverage, the last number is high coverage for repeats, contigs with an average coverage over this number are usually repeats. Dengfeng.

Pikayy commented 4 years ago

Thanks Dengfeng! Now it make sense to me. :)

yangxiaofeill commented 4 years ago

Hi Dengfeng, The automatically cutoff of my data is 5 45 75 90 150 270 The calcults.log is

[M::calcuts] Find 1 peaks
[M::calcuts] mean: 61, peak: 60, mean larger than peak, treat as diploid assembly
[W::calcuts] mean is not significantly different with peak, please recheck the cutoffs

That means I should manually set the cutoffs. I also ploted the hist plot of coverage by hist_plot.py script, and it looks like PB cov

From this, I don't know how to reset the cutoffs. Could you help me or give me some suggestions?

Best Xiaofei

dfguan commented 4 years ago

Hi Xiaofei,

You could set the cutoffs with "calcuts -m 45 -u 270 > cutoffs". Your plot seems pretty clean. For what reason you want to purge it? How are the busco scores?

Best,

Dengfeng.

yangxiaofeill commented 4 years ago

Hi Dengfeng,

Our genome is a new assembled genome. So I want to check if there are some dups. As your suggestions, I reset the cutoffs, and also confirmed it is a clean assembly. The Busco score is 95% percent.

Thanks.

Best Xiaofei

joeplanta commented 4 years ago

Hi Dengfeng,

I am checking out your program to deduplicate haplotigs in my draft assembly. My original cutoffs data is: 5 45 73 90 148 270 and I manually corrected the cutoffs with calcuts -m 73 -u 270 > cutoffs 2> calcults.log to generate purged.fa. Is this the correct parameter to use? I have also attached the histogram from the automatically-generated cutoffs.

PB cov

With this parameter I am able to bring the contig numbers in my draft assembly from ~1100 to ~600.

Thank you!

-Jose

dfguan commented 4 years ago

Hi Jose,

Thanks for trying purge_dups. I think I would like to use 84 for middle read depth and 240 for high, I mean use this command "calcuts -m 84 -u 240". Also you could to increase lower read depth for junk contigs, say 20, the command should be "calcuts -l20 -m84 -u240 > cutoffs". Please use the latest release of purge_dups (v1.0.1).

Any questions, please feel free to ask.

Cheers,

Dengfeng.

joeplanta commented 4 years ago

Hi Dengfeng,

Thanks a lot. I will try your suggestions. I will try purge_dups with our other assemblies as well.

-Jose

bio15anu commented 4 years ago

Hi Dengfeng. We have an assembly of a homozygous diploid where we didn't expect to see duplication, yet the assembly yielded a genome size larger than expected. We thought it could be due to sequencing errors which could lead to duplicate contigs so we had a look with BUSCO and purge_dups, where we obtained the plot below.

image

We believe the late peak is due to chloroplast DNA. Do you think the cutoffs here are appropriate? I'm not sure I understand exactly how we should optimise them manually. What values would you suggest in this case?

dfguan commented 4 years ago

Hi Adam, the cutoffs set here are not appropriate if it is a homozygous genome. I would guess the coverage is 20X of the genome size? If so, I would choose 3 as the low coverage, 10~13 as heterozygous coverage and 60 as high coverage, you can use "calcuts -l 3 -m 11 -u 60 PB.stat > cutoffs" to set. But please be noticed the chloroplast DNA will be moved into the hap.fa file (haplotig bin).

Any questions, please feel free to ask.

Best,

Dengfeng.

matryoskina commented 4 years ago

Hi @dfguan Can I also ask the same question as follow up? After aligning nanopore reads against the genome, and calculate stats and cutoffs, I get this message: [M::calcuts] Find 1 peaks [M::calcuts] mean: 45, peak: 37, mean larger than peak, treat as diploid assembly [W::calcuts] mean is not significantly different with peak, please recheck the cutoffs

Cutoffs are: 5 28 46 55 92 165

From what I read in the previous messages, it seems to me that I need to use bin/calcuts to manually calculate the cutoffs. But which values should I use? Thanks a lot for your help PB cov

dfguan commented 4 years ago

Hi @matryoskina , based on your plot, you could try 6 for low, 25 for middle and 220 for high, use calcuts -l 6 -m 25 -u 220 PB.stat > cutoffs to set. Best, Dengfeng.

matryoskina commented 4 years ago

Hi Dengfeng. Thank you a lot for your help! I think I understood how to set the -l and -m parameters for the manual cutoff, though I'm not sure about the -u. I also would like to know if it's possible to validate the results. I though to map the raw reads against the purged.fa, ricalculate cutoffs, and produce the histogram that, if the process went fine, should look like a homozygous profile. Does that make sense? Because what I got is a two-peaks curve. Plus, the Duplicated Buscos reduced (from 26.4% to 15.6%) but N50 did not significantly increased, and the genome size is still bigger than expected (~500 Mb more). Best, Alessia

dfguan commented 4 years ago

Hi Alessia, -u is set for high coverage contigs, if 80% bases of them are larger than -u, they will be moved into the haplotig bin. Three ways to validate the results, one is as what you said, recalculate the coverage histogram, hope you can observe a smooth gaussion distribution. The other two ways are mentioned in README FAQ, take a look here (https://github.com/dfguan/purge_dups#faq).

Best,

Dengfeng.

xiekunwhy commented 4 years ago

Hi, @dfguan I got the assembly from soapdenovo+dbg2olc, but busco duplication is very high(C:84.7%[S:56.1%,D:28.6%],F:3.5%,M:11.8%,n:1440) or (C:85.2%[S:55.3%,D:29.9%],F:8.5%,M:6.3%,n:425), and I tried to use purge_dups to remove redundance sequences, but I got a strange coverage hist plot(like below) and I don't known how to choose suitable cutoffs, can you give some suggestions?

hist

Best, Kun

dfguan commented 4 years ago

Hi Kun,

What's the average read depth? Do you have an estimation? From your plot, I may give "calcuts -l 10 -m 60 -h 240", but better to know sequencing depth first.

Best,

Dengfeng.

xiekunwhy commented 4 years ago

Hi Dengfeng,

The average read depth of this assembly is ~74X. I will try "calcuts -l 10 -m 60 -h 240".

Best, Kun

lyrk50 commented 4 years ago

Hi Dengfeng, I tried purge_dups with my ONT assemblies. The automatically cutoff of my data is5 25 39 40 84 186. The calcults.log is

[M::calcuts] Find 2 peaks
[M::calcuts] Merge local peaks and valleys: 2 peaks remain
[M::calcuts] Remove peaks and valleys less than 5: 2 peaks remain
[M::calcuts] Use top 3 frequent read depth
[M::calcuts] Found a valley in the middle of the peaks, use two-peak mode

image

I'm a little confused about how to use two-peak mode. Could you give me some advice?

dfguan commented 4 years ago

Hi, if calcuts program finds two peaks in your coverage (read depth) histogram, and finds a valley in the middle of the peaks, it will activate the two-peak mode, which uses the valley as the cutoff for heterozygous sequences. Please let me know if you still have questions. Best, Dengfeng.

lyrk50 commented 4 years ago

Hi, if calcuts program finds two peaks in your coverage (read depth) histogram, and finds a valley in the middle of the peaks, it will activate the two-peak mode, which uses the valley as the cutoff for heterozygous sequences. Please let me know if you still have questions. Best, Dengfeng.

Hi Dengfeng,Thank you for your reply. I have understood the two-peak mode.I have some problems with my other two assemblies. I need to manually calculate the cutoffs. Actually, all my assemblies were assembled with the same ONT sequencing data.The average read depth of all assembly is ~80X. The fist automatically cutoff is 5 47 77 93 155 279.

[M::calcuts] Find 1 peaks
[M::calcuts] mean: 64, peak: 62, mean larger than peak, treat as diploid assembly
[W::calcuts] mean is not significantly different with peak, please recheck the cutoffs

flye

The second automatically cutoff is5 46 76 91 152 273

[M::calcuts] Find 1 peaks
[M::calcuts] mean: 65, peak: 61, mean larger than peak, treat as diploid assembly
[W::calcuts] mean is not significantly different with peak, please recheck the cutoffs

wtdbg2 I don't known how to choose suitable cutoffs, could you give some suggestions?

zhaotao1987 commented 3 years ago

Hi Dengfeng,

Could you please help to check this one? I would like to hear your insights ;) Firstly, I think the default upper limit is a bit too high, right? how about lower 528 to 320? I'm not sure. Secondly, the high peak at the lower coverage means my assembly got some errors that should be fixed and corrected ? Thanks a lot !

PB cov

aramaiah commented 3 years ago

Hi Dengfeng, I used PB assembly with the default cutoff "5 57 95 115 191 345". File calcults.log shows, [M::calcuts] Find 1 peaks [M::calcuts] mean: 144, peak: 153, mean not larger than peak, treat as haploid assembly [W::calcuts] mean is not significantly different with peak, please recheck the cutoffs

I am not sure how to choose suitable cutoffs. Could you give some suggestions? Also brief me how do you choose? Thanks. PB cov

elissaralam commented 3 years ago

Hello Dengfeng,

I have a similar question to those posed above. The locally generated cutoff file contains:5 11 23 24 40 96

I am not sure if i should manually specify the cutoffs although several papers that use purge_dups on the same species i work with did specify them manually. I am not sure which lower, middle, and upper coverage cutoffs to choose and would really appreciate your help. Below is my histogram generated from hist_plot.py

pb histo

Thanks very much,, Elissar

dfguan commented 3 years ago

Hi Elissar, those cutoffs seem fine to me. Usually the valley (about 23) between two high peaks is chose to differentiate haplotigs and primary contigs. The valley (about 5 in your plot) near 0 although not be seen very often is chose to mark the junk contigs. Best, Dengfeng.

gaurav-agavekar commented 3 years ago

Hello, thanks for purge_dups.

I'm trying to perform purge_dup on an assembly obtained using HiCanu. I mapped CCS reads using the asm20 option in minimap2 (is it correct that I have to use asm20 also for generating self-self alignment?).

I got "mean is not significantly different with peak, please recheck the cutoffs" while calculating cutoffs. I'm attaching my histogram after running hist_plot.py can you please advise what cutoffs I could use? I'm not sure what part of the distribution to use as a valley. Is it ok if I use m55?

My assembly has very low heterozygosity and pretty high repeat content (30%).

Thanks!

PB cov

dlavrov commented 3 years ago

Hi Xuyue,

What does the second number of cutoffs mean? I got a negative number but I don't see any negative numbers among the posts here. My cutoffs are: 5 -16 144 145 333 717. I'm using Sequel II CLR mode raw sequences and Canu genome assembly.

Thank you very much for the program and for all your help here! Best wishes, Dennis

PB

yangxiaofeill commented 3 years ago

Hi Dengfeng, I tried to purge_dups my another assembly. The automatica cutoff is 5 10 36 37 59 144, and the plot looks like

notUsedToMerge fa pb stat2

The calculate.log shows

[M::calcuts] Find 2 peaks
[M::calcuts] Merge local peaks and valleys: 2 peaks remain
[M::calcuts] Remove peaks and valleys less than 5: 2 peaks remain
[M::calcuts] Use top 3 frequent read depth
[M::calcuts] Found a valley in the middle of the peaks, use two-peak mode

Then I tried to run purge_dups as purge_dups -2 -T cutoffs -c PB.base.cov $REF.split.self.paf.gz > dups.bed 2> purge_dups.log, however the dups.bed is empty.

I tried to set the cutoffs manually, by calcuts -l 5 -m 37 -u 144 > cutoffs 2>calcults.log based on the histgram plot, and get the cutoffs like 5 36 36 37 37 144. However, after purge_dups -2 -T cutoffs -c PB.base.cov $REF.split.self.paf.gz > dups.bed 2> purge_dups.log, the dups.bed is still empty.

Could you please help me to check why and give me suggestions to set the cutoffs ?

Thanks Xiaofei

Astahlke commented 2 years ago

Hello! Any suggestions for cutoffs with this assembly? We suspect that the sample is triploid, and the estimated genome size ranges between ~2.4-3.6 Gb depending on the lineage. Here's the histogram of CCS reads aligned against this primary pseudo hap with HiFiasm 0.16.1. We're trying to get down to a 1n haploid representation.

primary_hist

Using these cutoffs, purge_dups yielded an assembly with fewer duplicate arachnida BUSCOs but certainly still lots retained. It went from C:95.2%[S:29.0%,D:66.2%],F:2.0%,M:2.8% on the primary Hifiasm contigs to C:94.1%[S:73.1%,D:21.0%],F:2.2%,M:3.7% on the purged primary. In terms of total length, we went from from 5170.252 to 3560.207 MB.

Any advice to improve cutoffs on this to get it down to 1n assembly? Anything else we should generate to get a better handle on what's happening here? Thank you!

17863952296 commented 1 year ago

Hello Dengfeng,

I have a similar question to those posed above. The locally generated cutoff file contains:5 -14 32 33 55 132

I am not sure if i should manually specify the cutoffs although several papers that use purge_dups on the same species i work with did specify them manually. I am not sure which lower, middle, and upper coverage cutoffs to choose and would really appreciate your help. Below is my histogram generated from hist_plot.py PB cov Using these cutoffs, purge_dups yielded an assembly with fewer duplicate arachnida BUSCOs but certainly still lots retained. It went from C:95.0%[S:30.0%,D:65.0%],F:2.0%,M:3.0% on the primary Hifiasm contigs to C:94.0%[S:65.0%,D:29.0%],F:3.0%,M:3.0% on the purged primary. In terms of total length, we went from from 209.696MB to 184.428 MB.

Any advice to improve cutoffs on this to get it down to 1n assembly? Anything else we should generate to get a better handle on what's happening here? Thank you!

kishor2019 commented 3 months ago

Hi! I am a bit less oriented to genome study. I am running purge haploids and setting the cutoffs manually from the histogram. According to the following image, I have set the low, mid, and high cutoffs as 7, 37, and 110. Did I assume the points correctly? I would like to appriciate any of your help. Thanks!

f8ee77cc212c3b5bbe575151e636284