Closed aponsero closed 4 years ago
Alise,
Glad that PHANOTATE is working well for you. Thanks for the script, it can come in handy for situations where PHANOTATE has already been run using the default 'tabular' output file type.
Determining a cutoff for the scores is tricky, since the score means vary depending on the GC content (low GC have higher scores than high GC genomes).
Shown below is the probability of encountering a stop codon in the lambda phage genome (average GC content) and a mycobacterium phage (high GC content); and the score for an (unweighted) 30 codon ORF
> # lamba phage
> pstop = 0.047
> -1/(1-pstop)**30
[1] -4.238508
> # mycobacterium phage
> pstop = 0.028
> -1/(1-pstop)**30
[1] -2.344294
This means a fixed cutoff cannot be used very well, although anything above -1 can usually be discarded.
I have used a dynamic score based on the above 30 codon calculations log value, with mixed success.
I also tested out using clustering methods (like kmeans) to partition the scores as well. However, I have began working on version 2.0, which will incorporate a better method for creating a "training set" for the gene profile. This will move around the scores, so I will wait until after to resume cutoff testing.
~Katelyn~
Hi!
Thanks for your answer! Yes, I see, that makes sense. Indeed a fixed cutoff seem difficult to achieve, especially when working with assembled metagenomes.
good luck with version 2.0!
Alise
Hi!
Thank you for Phanotate, the tool is really useful and easy to use. I've been using it for my own work, and I was wondering what is a safe cutoff for selecting the ORFs predicted by Phanotate?
Also, I've written a short script to parse the Phanotate output and spit out fasta-format. Please use it as you want/need.
Thank you so much !
Cheers