NBISweden / Earth-Biogenome-Project-pilot

Assembly and Annotation workflows for analysing data in the Earth Biogenome Project pilot project.
https://www.earthbiogenome.org/
GNU General Public License v3.0
10 stars 8 forks source link

New Module: PurgeDups #30

Closed mahesh-panchal closed 10 months ago

mahesh-panchal commented 2 years ago

Which tool should be included?

https://github.com/dfguan/purge_dups

How is it used?

for i in $pb_list
do
    minimap2 -x map-hifi $pri_asm $i | gzip -c - > $i.paf.gz
done
bin/pbcstat *.paf.gz (produces PB.base.cov and PB.stat files)
bin/calcuts PB.stat > cutoffs 2>calcults.log
bin/split_fa $pri_asm > $pri_asm.split
minimap2 -xasm5 -DP $pri_asm.split $pri_asm.split | gzip -c - > $pri_asm.split.self.paf.gz
bin/purge_dups -2 -T cutoffs -c PB.base.cov $pri_asm.split.self.paf.gz > dups.bed 2> purge_dups.log
bin/get_seqs -e dups.bed $pri_asm 
cat hap.fa $hap_asm

Which workflow should it be included in?

Post assembly curation

mahesh-panchal commented 1 year ago

Ksenia from Sanger, made some important comments on the pull request: https://github.com/NBISweden/Earth-Biogenome-Project-pilot/pull/34.

mahesh-panchal commented 1 year ago

Note that the GeneScopeFK model.txt can be parsed for the haploid k-mer peak.

Formula: y_transform ~ x^transform_exp * length * predict2_0(r1, k, d, 
    kmercov, bias, x)

Parameters:
         Estimate Std. Error t value Pr(>|t|)    
d       1.388e-02  4.864e-04   28.52   <2e-16 ***
r1      2.795e-03  4.404e-05   63.47   <2e-16 ***
kmercov 8.506e+00  4.977e-03 1708.92   <2e-16 ***
bias    7.423e-01  5.827e-03  127.38   <2e-16 ***
length  2.120e+09  1.700e+06 1246.70   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4296000 on 987 degrees of freedom

Number of iterations to convergence: 5 
Achieved convergence tolerance: 1.49e-08

In this case the value is 8.506e+00 from kmercov 8.506e+00 4.977e-03 1708.92 <2e-16 *** This is rounded to 8.51 in the GeneScopeFK image. This can be used in purge_dups somehow to correct the expected coverage since in long read datasets, the k-mer coverage is equivalent to the read coverage for long reads.

mahesh-panchal commented 1 year ago
printf "%.2f\n" $(grep "^kmercov" gsfk_model.txt | cut -d" " -f2 )

Will extract k-mer coverage and round.