ThomasDOtto / ratt

Rapid Annotation Transfer Tool
GNU General Public License v3.0
8 stars 4 forks source link

features transferred with empty/invalid location #18

Open 0xaf1f opened 1 year ago

0xaf1f commented 1 year ago

In the last of a 7-contig monkeypox genome assembly, RATT produces a feature like this one at the top:

FT   repeat_region   complement()
FT                   /note="ITR"
FT                   /rpt_type=inverted
FT                   /rpt_type=terminal
FT   gene            267..1580
FT                   /locus_tag="mpox_00004"
FT                   /gene="mpox_00004"
FT   CDS             267..1580
FT                   /locus_tag="mpox_00004"
FT                   /note="Ankyrin (CPXV-017) D1L"
FT                   /codon_start=1
FT                   /product="MPXVgp004"
FT                   /protein_id="URK20443.1"
FT                   /gene="mpox_00004"

The invalid location complement() causes parser errors when trying to read this embl file. Input and output files attached.

Command used (ran from within the output directory):

ratt -p out -t Strain ../embls ../contig7.fasta

ratt-invalid-location.tar.gz

haessar commented 9 months ago

Sorry for not getting back to you sooner @0xaf1f. I had a play around with your files and found that the FT causing the issues were

FT   repeat_region   1..6439
FT                   /note="ITR"
FT                   /rpt_type=inverted
FT                   /rpt_type=terminal

In fact, removing these 4 lines from embls/mpox..ON563414.3.embl ensures that the "complement()" seen in the original output is no longer there.

The source code that is generating that "complement()" is in ratt_correction.pm:2206-2218 when trying to parse the coordinates from "FT repeat_region complement(-6437..1)" (see the intermediate file output/out.UnicyclerMpox.gnl_C_L_7.embl). I can assume this occurred during the Transfer step where the coord range 1..6439 was outside the bounds of the submitted sequence contig7.fasta (length 1667).

As a short term solution I'd recommend removing any such problem features from the input before running. Longer term there is clearly a bug in the code during this coord parsing (you might have seen the 4 "Use of uninitialized value" errors in the RATT stdout during Correction phase), but still need to figure out how to fix.