PATRIC3 / patric3_website

Legacy PATRIC Website (JBoss Portal Version)
MIT License
5 stars 2 forks source link

Wierd protein annotated by RASTtk #376

Open ARWattam opened 9 years ago

ARWattam commented 9 years ago

I annotated a version of Brucella BO2 in RASTtk in PATRIC and it called a very strange gene. Starting sequence is G. This shouldn't be happening screen shot 2015-08-05 at 12 25 47 pm

mshukla1 commented 9 years ago

Duplicate / Similar to: #246

mshukla1 commented 9 years ago

This gene is called near the contig boundary and is probably partial/truncated. This issue is similar to the one you reported before. Gary is here this week, we will try to discuss it with him and see if we reach any resolution.

screen shot 2015-08-05 at 11 39 42 am
ARWattam commented 9 years ago

We all know that it shouldn't be called at all, right? Doesn't matter if its at a contig boundary.

I hope you appreciate that I am riding in a car and reporting bugs at the same time!

----- Original Message ----- From: "mshukla1" notifications@github.com To: "PATRIC3" patric3_website@noreply.github.com Cc: "Rebecca Wattam" wattam@vbi.vt.edu Sent: Wednesday, August 5, 2015 12:40:54 PM Subject: Re: [patric3_website] Wierd protein annotated by RASTtk (#376)

This gene is called near the contig boundary and is probably partial/truncated. This issue is similar to the one you reported before. Gary is here this week, we will try to discuss it with him and see if we reach any resolution.

screen shot 2015-08-05 at 11 39 42 am

Reply to this email directly or view it on GitHub: https://github.com/PATRIC3/patric3_website/issues/376#issuecomment-128065950

olsonanl commented 9 years ago

This was my earlier email on the topic.

Rebecca came across an issue with the prodigal gene calls on her collaborator’s brucella genome.

It created a gene call at the start of the contig that was not starting with a start codon. Looks like by default prodigal inserts a start before the start of the contig:

Each node in the dynamic programming matrix is either a start codon (ATG, GTG, or TTG only: the program does not consider nonstandard starts such as ATA, ATT, or CTG) or a valid stop codon (specified by the translation table code). In addition, start and stop nodes are added in each frame at the edges of the sequence to handle cases where genes run off the edge of contigs, a common occurrence in draft and metagenomic sequence data.

This is the call we got with the default prodigal settings (as used in the Prodigal.pm code used by RASTtk). It correctly notes it as a partial gene:

unitig_0|quiver_1 # 2 # 229 # 1 # ID=1_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.504 ARLKRCGMPQDRIENAFNAAHLHGTGFLKRRFISVTERMVYEAIADYLQLPYTEEILLRV FLFPVKISVLPISGR*

Prodigal has an option to disallow this:

-c: Closed ends. Do not allow genes to run off edges.

When I run using that, I get a different gene at the start:

unitig_0|quiver_1 # 23 # 229 # 1 # ID=1_1;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.498 MPQDRIENAFNAAHLHGTGFLKRRFISVTERMVYEAIADYLQLPYTEEILLRVFLFPVKI SVLPISGR*

Glimmer called it starting at 23:

fig|2.2.peg.1 unitig_0|quiver_23_229

Genemark called it starting at 2:

unitig_0|quiver GeneMark.hmm CDS 2 229 -304.904274 + 0 gene_id=1, length=228, gene_score=-304.904274, rbs_score=-0.013333, rbs_spacer=-1, stop_enforced=N, start_codon=0, logodd=8.496079

What isn’t yet clear is a) which call is more ‘correct’. Rebecca wants to look at a MSA to get a feel for this b) how we should handle this in RASTtk.

Thoughts?

olsonanl commented 9 years ago

Gary then noted:

In the GenomeTypeObject, we have a place to record the assertion that a feature is incomplete at an end (or both). The SEED has a function that tries to guess this dynamically. My hope would be that we call the features, mark them as incomplete, and then let the user filter what they want to see. This does imply not putting an M as the start of a protein sequence that is believed to be truncated.

mshukla1 commented 9 years ago

For a bit of clarification, below are the details for the flhb gene annotated by RAST (public genome) and that annotated by RASTtk (which is Rebecca’s private version).

RAST picked up canonical start at position 18 and called it full gene. RASTtk picked a start at position 3 with the odd start.

patric_id,accession,start,aa_sequence fig|693750.9.peg.942,NZ_ADFA01000055,3,GRAGQVEFLKSLAKLLAA fig|693750.4.peg.946,NZ_ADFA01000055,18,MEFLKSLAKLLAA

hyoo commented 7 years ago

@mshukla1 I guess this is already addressed. would you confirm?