Protein sequence from gene call

snayfach commented 5 years ago

What's the best way to determine if an ORF is protein coding and the protein sequence that it encodes?

carolzhou commented 5 years ago

I will answer for Kate... PHANOTATE returns coordinates of protein coding genes. To automatically generate the amino-acid sequences, you can use multiPhATE, a phage annotation pipeline, which incorporates PHANOTATE. A parameter, translate_only=‘true’, will do gene-calling and translation.

https://github.com/carolzhou/multiPhATE.git. See README.

Kate: Feel free to correct.

-Carol Zhou

Sent from my iPhone

On May 14, 2019, at 1:58 PM, Stephen Nayfach notifications@github.com<mailto:notifications@github.com> wrote:

What's the best way to determine if an ORF is protein coding and the protein sequence that it encodes?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/deprekate/PHANOTATE/issues/7?email_source=notifications&email_token=ACWGOOZ6J64LCOKXRLFYLFTPVMRVTA5CNFSM4HM5TQB2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4GTYYNHA, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ACWGOO2RW5MFSE5QDTXPUMLPVMRVTANCNFSM4HM5TQBQ.

snayfach commented 5 years ago

Can you briefly describe how the translation works? Does it use the standard genetic code using the PHANNOTATE ORFs?

carolzhou commented 5 years ago

multiPhATE’s default genetic code for translation is “bacterial”. It uses transeq within EMBOSS. Yes, using the PHANOTATE ORFs, or you can select another gene finder.

Sent from my iPhone

On May 14, 2019, at 2:29 PM, Stephen Nayfach notifications@github.com<mailto:notifications@github.com> wrote:

Can you briefly describe how the translation works? Does it use the standard genetic code using the PHANNOTATE ORFs?

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/deprekate/PHANOTATE/issues/7?email_source=notifications&email_token=ACWGOO7ZRABAJOYJKGWOER3PVMVLVA5CNFSM4HM5TQB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVM275Q#issuecomment-492417014, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ACWGOO4WNRZGABTRAKFO4ZLPVMVLVANCNFSM4HM5TQBQ.

snayfach commented 5 years ago

Thanks for your responses!

deprekate commented 5 years ago

You can output the protein coding genes as nucleic acids using the -f fasta flag. Then you can use any tool to translate them into amino-acids.

In the future I will add the ability to output them as amino-acids, in fact a bit of code from a subsequent PHANOTATE version made it into this release: https://github.com/deprekate/PHANOTATE/blob/7ebb90a5f324a62e885471b171454d87af22c50e/lib/orfs.py#L102 (an easy way to translate into amino-acids with under 10 lines of python code)

snayfach commented 5 years ago

Thanks I think that will be a useful addition.

Unfortunately, the tool is too slow for my needs, so I will be sticking with Prodigal for the time being. If you find a way to increase the speed, please let me know!

deprekate commented 5 years ago

yep, since it is written in Python it is quite slow. I may need to re-write certain parts of the code in C (like I did with fastpathz).

To speed things up considerably you can use pypy instead of python to run it, which is what I do: pypy phanotate.py INFILE.FNA

snayfach commented 5 years ago

Interesting. How much of a speedup do you get using pypy?

deprekate commented 5 years ago

It went from an hour to a minute. I just ran it on the T4 sequence file that is in the tests folder, which is 170kb, so a longer genome, and it took 59 seconds on my old HP laptop

snayfach commented 5 years ago

Can you add the pypy information to the readme (that it makes the tool run faster)? That information will be useful to others

deprekate commented 5 years ago

Sure thing. I actually meant to, but guess I didn't push the hint. It's quite amazing the speedup that comes with JIT compiling python code : )

deprekate / PHANOTATE

Protein sequence from gene call #7