ThomasDOtto / ratt

Rapid Annotation Transfer Tool
GNU General Public License v3.0
8 stars 4 forks source link

RATT transfers `order( , )` coordinates and loses a parenthesis #12

Open 0xaf1f opened 1 year ago

0xaf1f commented 1 year ago

The reference annotation contains

FT   gene            3593369..3593852
FT                   /locus_tag="Rv3216"
FT                   /pseudogene="unknown"
FT                   /db_xref="GeneID:888845"
FT   misc_feature    order(3593369..3593437,3593439..3593852)
FT                   /locus_tag="Rv3216"
FT                   /note="acetyltransferase (2.3.1.-), contains GNAT domain
FT                   (GCN5-like N-acetyltransferase. See Vetting et al. 2005),
FT                   probably pseudogene as appears frameshifted due to 1bp
FT                   insertion at position 3593438. Frameshift present in all
FT                   sequenced tubercle bacilli. Start changed since first
FT                   submission, extended by 50aa."
FT                   /pseudogene="unknown"
FT                   /db_xref="PSEUDO:CCP46032.1"

which gets transferred to the input assembly as

FT   gene            complement(116773..117256)
FT                   /locus_tag="Rv3216"
FT                   /note="*pseudogene: unknown"
FT                   /db_xref="GeneID:888845"
FT                   /gene="Rv3216"
FT   misc_feature    complement(order(116773..117256)
FT                   /locus_tag="Rv3216"

and then parsing the annotation file fails because the misc_feature coordinate has an unbalanced parenthesis.

ThomasDOtto commented 1 year ago

Hi,

The issues is the order tag. I think in the past I had a regular expression to replace it. Let me have a look at your fix.

Best, Thomas

On 16 Feb 2023, at 21:56, Afif Elghraoui @.***> wrote:

The reference annotation https://www.ncbi.nlm.nih.gov/nuccore/NC_000962.3 contains

FT gene 3593369..3593852 FT /locus_tag="Rv3216" FT /pseudogene="unknown" FT /db_xref="GeneID:888845" FT misc_feature order(3593369..3593437,3593439..3593852) FT /locus_tag="Rv3216" FT /note="acetyltransferase (2.3.1.-), contains GNAT domain FT (GCN5-like N-acetyltransferase. See Vetting et al. 2005), FT probably pseudogene as appears frameshifted due to 1bp FT insertion at position 3593438. Frameshift present in all FT sequenced tubercle bacilli. Start changed since first FT submission, extended by 50aa." FT /pseudogene="unknown" FT /db_xref="PSEUDO:CCP46032.1" which gets transferred to the input assembly as

FT gene complement(116773..117256) FT /locus_tag="Rv3216" FT /note="*pseudogene: unknown" FT /db_xref="GeneID:888845" FT /gene="Rv3216" FT misc_feature complement(order(116773..117256) FT /locus_tag="Rv3216" and then parsing the annotation file fails because the misc_feature coordinate has an unbalanced parenthesis.

— Reply to this email directly, view it on GitHub https://github.com/ThomasDOtto/ratt/issues/12, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEOT7ET5UZAGYUMYJEJO47LWX2PCRANCNFSM6AAAAAAU6WUT3A. You are receiving this because you are subscribed to this thread.

haessar commented 1 year ago

@0xaf1f Thomas refers to a fix, are you aware of this?

0xaf1f commented 1 year ago

No, I haven't gotten to it yet since I've been working on my own code. I think RATT would benefit from using Bioperl to read/write embl files (it might even take care of #10), but I haven't looked into how disruptive that would be versus updating a regex. I wouldn't suggest waiting for me when your focus is already here.

haessar commented 1 year ago

Using Bio::SeqIO (Bioperl) would allow me to essentially replace main.ratt.pl:300-500 or so with only a few lines of code, if I have it right. Will put it on the to-do list.

ThomasDOtto commented 1 year ago

But it requires to install bioPerl, which was annoying in the past…

Best, Thomas

On 17 Mar 2023, at 15:02, Will Haese-Hill @.***> wrote:

Using Bio::SeqIO (Bioperl) would allow me to essentially replace main.ratt.pl:300-500 or so with only a few lines of code, if I have it right. Will put it on the to-do list.

— Reply to this email directly, view it on GitHub https://github.com/ThomasDOtto/ratt/issues/12#issuecomment-1473977738, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEOT7EUMMSZWFD6W3EIHUVLW4R4HRANCNFSM6AAAAAAU6WUT3A. You are receiving this because you commented.