NBISweden / EMBLmyGFF3

An efficient way to convert gff3 annotation files into EMBL format ready to submit.
GNU General Public License v3.0
59 stars 16 forks source link

ID not taken as locus tag #50

Closed rob123king closed 4 years ago

rob123king commented 4 years ago

Hi, excited to see something that may make things easier. bit of a nightmare otherwise. I have ID's within my Gff and was expecting them to be used for the locus tags but they are not and sequential numbers are instead. A note is created of the ID which I think would be better if just the locus_tag became the ID as I think that is it's purpose. I don't have gene names in gff but ideally a tab file of gene names could be given to add these to the resulting embl file too, as this is the likely starting point of having gene names available. I would also like to parse the exon number after the : and add this in, although I don't think this is essential. I'm still trying to work out the ENA format requirements for submission. I think I could just have a locus tag as the minimum feature and what I'm working towards. The webin validation tool complains about overlapping UTR and CDS features of two genes in the same direction. Could a correction part be added to cleave UTR and correct gene when detects this? As I have to work out how to fix this and start again. I know of a script somewhere that will do the cleaving of UTR at least. Sorry a few change requests or otherwise I'll try to make the changes myself when time but harder when don't know the code.

FT mRNA join(433449..433533,433946..434073,434612..434836, FT 435438..435904) FT /locus_tag="SPEXI_LOCUS1" FT /note="source:maker" FT /note="ID:SPEXI_01T000001" FT CDS join(433449..433533,433946..434073,434612..434836, FT 435438..435710) FT /locus_tag="SPEXI_LOCUS1" FT /note="source:maker" FT /note="ID:SPEXI_01T000001:cds" FT /transl_table=1 FT exon 433449..433533 FT /locus_tag="SPEXI_LOCUS1" FT /note="source:maker" FT /note="ID:SPEXI_01T000001:1"

Juke34 commented 4 years ago

You can use the attribute of you choice as locus_tag using the --use_attribute_value_as_locus_tag parameter. But be aware that when submitting the file to ENA, the locus_tag will be anyway overwritten.

About overlapping UTR and CDS you could automatically fix it using gff3_sp_fix_features_locations_duplicated.pl from AGAT.
The same if you encounter problems with short introns you can use gff3_sp_flag_short_introns.pl.

For the gene names the easier is to have it prior conversion in the GFF file, then it will automatically be included in the EMBL file.
To load gene names from a blast output in your GFF file you can use agat_sp_manage_functional_annotation.pl.
If you want to add the gene names afterwards in the EMBL file you will have to code your own script (don't hesitate to share it then, I could include it here in case someone else would like to do the same).

Juke34 commented 4 years ago

We didn't hear anything back from you for a while, I guess you found your way. So I close the issue but feel free to re-open it necessary.