Closed mahesh-panchal closed 10 months ago
Ksenia from Sanger, made some important comments on the pull request: https://github.com/NBISweden/Earth-Biogenome-Project-pilot/pull/34.
Note that the GeneScopeFK model.txt can be parsed for the haploid k-mer peak.
Formula: y_transform ~ x^transform_exp * length * predict2_0(r1, k, d,
kmercov, bias, x)
Parameters:
Estimate Std. Error t value Pr(>|t|)
d 1.388e-02 4.864e-04 28.52 <2e-16 ***
r1 2.795e-03 4.404e-05 63.47 <2e-16 ***
kmercov 8.506e+00 4.977e-03 1708.92 <2e-16 ***
bias 7.423e-01 5.827e-03 127.38 <2e-16 ***
length 2.120e+09 1.700e+06 1246.70 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4296000 on 987 degrees of freedom
Number of iterations to convergence: 5
Achieved convergence tolerance: 1.49e-08
In this case the value is 8.506e+00
from kmercov 8.506e+00 4.977e-03 1708.92 <2e-16 ***
This is rounded to 8.51
in the GeneScopeFK image.
This can be used in purge_dups somehow to correct the expected coverage since in long read datasets, the k-mer coverage is equivalent to the read coverage for long reads.
printf "%.2f\n" $(grep "^kmercov" gsfk_model.txt | cut -d" " -f2 )
Will extract k-mer coverage and round.
Which tool should be included?
https://github.com/dfguan/purge_dups
How is it used?
Which workflow should it be included in?
Post assembly curation