gbouras13 / pharokka

fast phage annotation program
MIT License
147 stars 15 forks source link

External tools Fine parameter tuning #299

Closed iferres closed 9 months ago

iferres commented 1 year ago

Description

Hi again @gbouras13 , this is not a bug report, just some suggestions. I have been trying and studying some of the external software pharokka uses, and I think some parameters could be adjusted to better feature detection in phages.

For instance, analysis of jumbophage CRISPR-Cas systems report spacers of 14-20 nt in length (see here), and minced defaults to a minimum of 26. The same paper reports repeats of 26 nt in length, minced defaults to a minimum of 23, which is fine in this case. And lastly, minced reports only CRISPRs with at least 3 spacers, which I would relax to 2 since there are very compact CRISPR-Cas system in this phages.

minced -h

MinCED, a program to find CRISPRs in shotgun DNA sequences or full genomes

Usage:    minced [options] file.fa [outputFile]

Options:  -searchWL  Length of search window used to discover CRISPRs (range: 6-9). Default: 8
          -minNR     Minimum number of repeats a CRISPR must contain. Default: 3
          -minRL     Minimum length of the CRISPR repeats. Default: 23
          -maxRL     Maximum length of the CRISPR repeats. Default: 47
          -minSL     Minimum length of the CRISPR spacers. Default: 26
...

For mash based identification against inphared index, I see that pharokka uses a minimum mash distance of 0.1 to report matches. I actually think that parameter default is very good, but I can think of use cases where manually setting this parameter could be useful.

Pharokka may benefit of implementing this for some of the other external tools it uses.

I'm not sure which is the best implementation for this feature (and maybe is already possible??), and I'm not enough of a python guru to implement it myself and make a pull request in the short term, although I forked the repo and maybe will try in the future.

Thanks again for implementing pharokka!

gbouras13 commented 1 year ago

Hi @iferres ,

These are great points. I really appreciate the effort you have put into this and explaining it.

I actually think this shouldn't be too hard to implement.

For Minced I'd create some arguments like:

For mash I'd just add the distance in as a param --mash_distance.

If you have other recommendations (trnascan, Aragorn) please let me know and I will consider it.

To be clear you can't do this now, I'll add them into the next update of Pharokka :)

George

iferres commented 1 year ago

Thank you George, that parameter implementation looks very good to me. For now those are the only ones which took my attention, but I'm working on these topics so I may reach you again in the future.

Bests!

gbouras13 commented 9 months ago

Hi @iferres ,

I've implemented this in v1.6 - if you are keen it is on the dev branch, otherwise should be out next week :)

George

iferres commented 9 months ago

Genius, thank you very much! Next week I will be working with fresh new phages, I will surely try the dev/new version and report back. Bests!