deprekate / PHANOTATE

PHANOTATE: a tool to annotate phage genomes.
GNU General Public License v3.0
69 stars 8 forks source link

Cutoff for phanotate score #12

Closed aponsero closed 4 years ago

aponsero commented 4 years ago

Hi!

Thank you for Phanotate, the tool is really useful and easy to use. I've been using it for my own work, and I was wondering what is a safe cutoff for selecting the ORFs predicted by Phanotate?

Also, I've written a short script to parse the Phanotate output and spit out fasta-format. Please use it as you want/need.

Thank you so much !

Cheers

#!/usr/bin/python
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
import argparse
import os
import sys

def get_args():
    """Get command-line arguments"""

    parser = argparse.ArgumentParser(
        description='Parse results phanotate',
        formatter_class=argparse.ArgumentDefaultsHelpFormatter)

    parser.add_argument('-i',
                        '--pinput',
                        help='Phanotate output file',
                        metavar='PINPUT',
                        type=str,
                        default="")

    parser.add_argument('-f',
                        '--finput',
                        help='Sequence fasta file',
                        metavar='FINPUT',
                        type=str,
                        default="")

    parser.add_argument('-p',
                        '--protein',
                        help='Protein output file name',
                        metavar='PROT',
                        type=str,
                        default="")

    parser.add_argument('-g',
                        '--gene',
                        help='Gene output file name',
                        metavar='GENE',
                        type=str,
                        default="")

    return parser.parse_args()

def main():
    args=get_args()
    fasta_file=args.finput
    phanotate_file=args.pinput
    protein_file=args.protein
    gene_file=args.gene

    record_dict = SeqIO.to_dict(SeqIO.parse(fasta_file, "fasta"))

    phanotate_CDS=[]
    phanotate_proteins=[]
    i=0

    # read the phanotate output file as a tsv
    with open(phanotate_file) as f:
        lines = f.read().split('\n')[:-1]
        for c, line in enumerate(lines):
            if not line.startswith('#'): #skip the comment lines from the file
                i=i+1
                data = line.split();
                curr_record=record_dict[data[3]]

                # get start and end of the CDS
                if int(data[0]) < int(data[1]):
                    start=int(data[0])-1
                    end=int(data[1])-1
                else:
                    start=int(data[1])
                    end=int(data[0])

                curr_CDS = curr_record.seq[start:end]

                # get reverse complement if strand == -
                if data[2] == "-":
                    curr_CDS=curr_CDS.reverse_complement()
                    print("reverse complement")
                else:
                    print(curr_CDS)

                #get a name for the CDS and create a record
                CDS_id=data[3]+"_CDS_"+str(i)
                new_record = SeqRecord(curr_CDS,CDS_id, '', '')
                new_record_prot = SeqRecord(curr_CDS.translate(),CDS_id, '', '')            
                # add record in the output list
                phanotate_CDS.append(new_record)
                phanotate_proteins.append(new_record_prot)

                # print verifications
                print(new_record.id)
                print(str(start)+"--"+str(end))

    # write output file
    SeqIO.write(phanotate_CDS, gene_file, "fasta")

    # write protein output file
    SeqIO.write(phanotate_proteins, protein_file, "fasta")

if __name__ == '__main__':
    main()
deprekate commented 4 years ago

Alise,

Glad that PHANOTATE is working well for you. Thanks for the script, it can come in handy for situations where PHANOTATE has already been run using the default 'tabular' output file type.

Determining a cutoff for the scores is tricky, since the score means vary depending on the GC content (low GC have higher scores than high GC genomes).

Shown below is the probability of encountering a stop codon in the lambda phage genome (average GC content) and a mycobacterium phage (high GC content); and the score for an (unweighted) 30 codon ORF

> # lamba phage
> pstop = 0.047
> -1/(1-pstop)**30
[1] -4.238508
> # mycobacterium phage
> pstop = 0.028
> -1/(1-pstop)**30
[1] -2.344294

This means a fixed cutoff cannot be used very well, although anything above -1 can usually be discarded.

I have used a dynamic score based on the above 30 codon calculations log value, with mixed success.

I also tested out using clustering methods (like kmeans) to partition the scores as well. However, I have began working on version 2.0, which will incorporate a better method for creating a "training set" for the gene profile. This will move around the scores, so I will wait until after to resume cutoff testing.

~Katelyn~

aponsero commented 4 years ago

Hi!

Thanks for your answer! Yes, I see, that makes sense. Indeed a fixed cutoff seem difficult to achieve, especially when working with assembled metagenomes.

good luck with version 2.0!

Alise