Open Pikayy opened 4 years ago
Hi Xuyue, The cutoffs usually can be selected automatically, but it may select wrong cutoffs sometimes, you can use hist_plot.py to validate them visually by using the command "python3 hist_plot.py -c cutoff_file PB.stat PB.cov.png". As for the six numbers, the first is low coverage, any contig with an average coverge less than the number is a junk contig, the forth number is diploid (middle) coverage, the last number is high coverage for repeats, contigs with an average coverage over this number are usually repeats. Dengfeng.
Thanks Dengfeng! Now it make sense to me. :)
Hi Dengfeng,
The automatically cutoff of my data is 5 45 75 90 150 270
The calcults.log is
[M::calcuts] Find 1 peaks
[M::calcuts] mean: 61, peak: 60, mean larger than peak, treat as diploid assembly
[W::calcuts] mean is not significantly different with peak, please recheck the cutoffs
That means I should manually set the cutoffs.
I also ploted the hist plot of coverage by hist_plot.py
script, and it looks like
From this, I don't know how to reset the cutoffs. Could you help me or give me some suggestions?
Best Xiaofei
Hi Xiaofei,
You could set the cutoffs with "calcuts -m 45 -u 270 > cutoffs". Your plot seems pretty clean. For what reason you want to purge it? How are the busco scores?
Best,
Dengfeng.
Hi Dengfeng,
Our genome is a new assembled genome. So I want to check if there are some dups. As your suggestions, I reset the cutoffs, and also confirmed it is a clean assembly. The Busco score is 95% percent.
Thanks.
Best Xiaofei
Hi Dengfeng,
I am checking out your program to deduplicate haplotigs in my draft assembly. My original cutoffs data is: 5 45 73 90 148 270
and I manually corrected the cutoffs with calcuts -m 73 -u 270 > cutoffs 2> calcults.log
to generate purged.fa. Is this the correct parameter to use? I have also attached the histogram from the automatically-generated cutoffs.
With this parameter I am able to bring the contig numbers in my draft assembly from ~1100 to ~600.
Thank you!
-Jose
Hi Jose,
Thanks for trying purge_dups. I think I would like to use 84 for middle read depth and 240 for high, I mean use this command "calcuts -m 84 -u 240". Also you could to increase lower read depth for junk contigs, say 20, the command should be "calcuts -l20 -m84 -u240 > cutoffs". Please use the latest release of purge_dups (v1.0.1).
Any questions, please feel free to ask.
Cheers,
Dengfeng.
Hi Dengfeng,
Thanks a lot. I will try your suggestions. I will try purge_dups with our other assemblies as well.
-Jose
Hi Dengfeng. We have an assembly of a homozygous diploid where we didn't expect to see duplication, yet the assembly yielded a genome size larger than expected. We thought it could be due to sequencing errors which could lead to duplicate contigs so we had a look with BUSCO and purge_dups, where we obtained the plot below.
We believe the late peak is due to chloroplast DNA. Do you think the cutoffs here are appropriate? I'm not sure I understand exactly how we should optimise them manually. What values would you suggest in this case?
Hi Adam, the cutoffs set here are not appropriate if it is a homozygous genome. I would guess the coverage is 20X of the genome size? If so, I would choose 3 as the low coverage, 10~13 as heterozygous coverage and 60 as high coverage, you can use "calcuts -l 3 -m 11 -u 60 PB.stat > cutoffs" to set. But please be noticed the chloroplast DNA will be moved into the hap.fa file (haplotig bin).
Any questions, please feel free to ask.
Best,
Dengfeng.
Hi @dfguan Can I also ask the same question as follow up? After aligning nanopore reads against the genome, and calculate stats and cutoffs, I get this message: [M::calcuts] Find 1 peaks [M::calcuts] mean: 45, peak: 37, mean larger than peak, treat as diploid assembly [W::calcuts] mean is not significantly different with peak, please recheck the cutoffs
Cutoffs are: 5 28 46 55 92 165
From what I read in the previous messages, it seems to me that I need to use bin/calcuts to manually calculate the cutoffs. But which values should I use?
Thanks a lot for your help
Hi @matryoskina , based on your plot, you could try 6 for low, 25 for middle and 220 for high, use calcuts -l 6 -m 25 -u 220 PB.stat > cutoffs
to set. Best, Dengfeng.
Hi Dengfeng. Thank you a lot for your help! I think I understood how to set the -l and -m parameters for the manual cutoff, though I'm not sure about the -u. I also would like to know if it's possible to validate the results. I though to map the raw reads against the purged.fa, ricalculate cutoffs, and produce the histogram that, if the process went fine, should look like a homozygous profile. Does that make sense? Because what I got is a two-peaks curve. Plus, the Duplicated Buscos reduced (from 26.4% to 15.6%) but N50 did not significantly increased, and the genome size is still bigger than expected (~500 Mb more). Best, Alessia
Hi Alessia, -u is set for high coverage contigs, if 80% bases of them are larger than -u, they will be moved into the haplotig bin. Three ways to validate the results, one is as what you said, recalculate the coverage histogram, hope you can observe a smooth gaussion distribution. The other two ways are mentioned in README FAQ, take a look here (https://github.com/dfguan/purge_dups#faq).
Best,
Dengfeng.
Hi, @dfguan I got the assembly from soapdenovo+dbg2olc, but busco duplication is very high(C:84.7%[S:56.1%,D:28.6%],F:3.5%,M:11.8%,n:1440) or (C:85.2%[S:55.3%,D:29.9%],F:8.5%,M:6.3%,n:425), and I tried to use purge_dups to remove redundance sequences, but I got a strange coverage hist plot(like below) and I don't known how to choose suitable cutoffs, can you give some suggestions?
Best, Kun
Hi Kun,
What's the average read depth? Do you have an estimation? From your plot, I may give "calcuts -l 10 -m 60 -h 240", but better to know sequencing depth first.
Best,
Dengfeng.
Hi Dengfeng,
The average read depth of this assembly is ~74X. I will try "calcuts -l 10 -m 60 -h 240".
Best, Kun
Hi Dengfeng,
I tried purge_dups with my ONT assemblies.
The automatically cutoff of my data is5 25 39 40 84 186
.
The calcults.log is
[M::calcuts] Find 2 peaks
[M::calcuts] Merge local peaks and valleys: 2 peaks remain
[M::calcuts] Remove peaks and valleys less than 5: 2 peaks remain
[M::calcuts] Use top 3 frequent read depth
[M::calcuts] Found a valley in the middle of the peaks, use two-peak mode
I'm a little confused about how to use two-peak mode. Could you give me some advice?
Hi, if calcuts program finds two peaks in your coverage (read depth) histogram, and finds a valley in the middle of the peaks, it will activate the two-peak mode, which uses the valley as the cutoff for heterozygous sequences. Please let me know if you still have questions. Best, Dengfeng.
Hi, if calcuts program finds two peaks in your coverage (read depth) histogram, and finds a valley in the middle of the peaks, it will activate the two-peak mode, which uses the valley as the cutoff for heterozygous sequences. Please let me know if you still have questions. Best, Dengfeng.
Hi Dengfeng,Thank you for your reply. I have understood the two-peak mode.I have some problems with my other two assemblies. I need to manually calculate the cutoffs. Actually, all my assemblies were assembled with the same ONT sequencing data.The average read depth of all assembly is ~80X.
The fist automatically cutoff is 5 47 77 93 155 279
.
[M::calcuts] Find 1 peaks
[M::calcuts] mean: 64, peak: 62, mean larger than peak, treat as diploid assembly
[W::calcuts] mean is not significantly different with peak, please recheck the cutoffs
The second automatically cutoff is5 46 76 91 152 273
[M::calcuts] Find 1 peaks
[M::calcuts] mean: 65, peak: 61, mean larger than peak, treat as diploid assembly
[W::calcuts] mean is not significantly different with peak, please recheck the cutoffs
I don't known how to choose suitable cutoffs, could you give some suggestions?
Hi Dengfeng,
Could you please help to check this one? I would like to hear your insights ;) Firstly, I think the default upper limit is a bit too high, right? how about lower 528 to 320? I'm not sure. Secondly, the high peak at the lower coverage means my assembly got some errors that should be fixed and corrected ? Thanks a lot !
Hi Dengfeng, I used PB assembly with the default cutoff "5 57 95 115 191 345". File calcults.log shows, [M::calcuts] Find 1 peaks [M::calcuts] mean: 144, peak: 153, mean not larger than peak, treat as haploid assembly [W::calcuts] mean is not significantly different with peak, please recheck the cutoffs
I am not sure how to choose suitable cutoffs. Could you give some suggestions? Also brief me how do you choose? Thanks.
Hello Dengfeng,
I have a similar question to those posed above. The locally generated cutoff file contains:5 11 23 24 40 96
I am not sure if i should manually specify the cutoffs although several papers that use purge_dups on the same species i work with did specify them manually. I am not sure which lower, middle, and upper coverage cutoffs to choose and would really appreciate your help. Below is my histogram generated from hist_plot.py
Thanks very much,, Elissar
Hi Elissar, those cutoffs seem fine to me. Usually the valley (about 23) between two high peaks is chose to differentiate haplotigs and primary contigs. The valley (about 5 in your plot) near 0 although not be seen very often is chose to mark the junk contigs. Best, Dengfeng.
Hello, thanks for purge_dups.
I'm trying to perform purge_dup on an assembly obtained using HiCanu. I mapped CCS reads using the asm20 option in minimap2 (is it correct that I have to use asm20 also for generating self-self alignment?).
I got "mean is not significantly different with peak, please recheck the cutoffs" while calculating cutoffs. I'm attaching my histogram after running hist_plot.py can you please advise what cutoffs I could use? I'm not sure what part of the distribution to use as a valley. Is it ok if I use m55?
My assembly has very low heterozygosity and pretty high repeat content (30%).
Thanks!
Hi Xuyue,
What does the second number of cutoffs mean? I got a negative number but I don't see any negative numbers among the posts here. My cutoffs are: 5 -16 144 145 333 717. I'm using Sequel II CLR mode raw sequences and Canu genome assembly.
Thank you very much for the program and for all your help here! Best wishes, Dennis
Hi Dengfeng,
I tried to purge_dups my another assembly. The automatica cutoff is 5 10 36 37 59 144
, and the plot looks like
The calculate.log
shows
[M::calcuts] Find 2 peaks
[M::calcuts] Merge local peaks and valleys: 2 peaks remain
[M::calcuts] Remove peaks and valleys less than 5: 2 peaks remain
[M::calcuts] Use top 3 frequent read depth
[M::calcuts] Found a valley in the middle of the peaks, use two-peak mode
Then I tried to run purge_dups
as purge_dups -2 -T cutoffs -c PB.base.cov $REF.split.self.paf.gz > dups.bed 2> purge_dups.log
, however the dups.bed
is empty.
I tried to set the cutoffs
manually, by calcuts -l 5 -m 37 -u 144 > cutoffs 2>calcults.log
based on the histgram plot, and get the cutoffs like 5 36 36 37 37 144
.
However, after purge_dups -2 -T cutoffs -c PB.base.cov $REF.split.self.paf.gz > dups.bed 2> purge_dups.log
, the dups.bed
is still empty.
Could you please help me to check why and give me suggestions to set the cutoffs ?
Thanks Xiaofei
Hello! Any suggestions for cutoffs with this assembly? We suspect that the sample is triploid, and the estimated genome size ranges between ~2.4-3.6 Gb depending on the lineage. Here's the histogram of CCS reads aligned against this primary pseudo hap with HiFiasm 0.16.1. We're trying to get down to a 1n haploid representation.
Using these cutoffs, purge_dups yielded an assembly with fewer duplicate arachnida BUSCOs but certainly still lots retained. It went from C:95.2%[S:29.0%,D:66.2%],F:2.0%,M:2.8% on the primary Hifiasm contigs to C:94.1%[S:73.1%,D:21.0%],F:2.2%,M:3.7% on the purged primary. In terms of total length, we went from from 5170.252 to 3560.207 MB.
Any advice to improve cutoffs on this to get it down to 1n assembly? Anything else we should generate to get a better handle on what's happening here? Thank you!
Hello Dengfeng,
I have a similar question to those posed above. The locally generated cutoff file contains:5 -14 32 33 55 132
I am not sure if i should manually specify the cutoffs although several papers that use purge_dups on the same species i work with did specify them manually. I am not sure which lower, middle, and upper coverage cutoffs to choose and would really appreciate your help.
Below is my histogram generated from hist_plot.py
Using these cutoffs, purge_dups yielded an assembly with fewer duplicate arachnida BUSCOs but certainly still lots retained. It went from C:95.0%[S:30.0%,D:65.0%],F:2.0%,M:3.0% on the primary Hifiasm contigs to C:94.0%[S:65.0%,D:29.0%],F:3.0%,M:3.0% on the purged primary. In terms of total length, we went from from 209.696MB to 184.428 MB.
Any advice to improve cutoffs on this to get it down to 1n assembly? Anything else we should generate to get a better handle on what's happening here? Thank you!
Hi! I am a bit less oriented to genome study. I am running purge haploids and setting the cutoffs manually from the histogram. According to the following image, I have set the low, mid, and high cutoffs as 7, 37, and 110. Did I assume the points correctly? I would like to appriciate any of your help. Thanks!
Hi, I have been trying to manually select the cutoffs by running "hist_plot.py". As I understand, we are supposed to select cutoff values from the count-read_depth image and manually change the "cutoffs" file. The "cutoffs" file contains one line with 6 numbers. Could you explain the meaning of each of the 6 numbers and which one should I look at to validate the cutoff results?
Many thanks!