NBISweden / EMBLmyGFF3

An efficient way to convert gff3 annotation files into EMBL format ready to submit.
GNU General Public License v3.0
59 stars 15 forks source link

Features locations are duplicated - consider merging qualifiers #33

Closed Anto007 closed 5 years ago

Anto007 commented 5 years ago

Thanks for this nice tool. I'm running into an issue trying to validate embl files that were generated on your tool. I'm using webin-cli-1.7.1 and it throws up the below error when I try to validate/submit the embl files

ERROR: "tRNA" Features locations are duplicated - consider merging qualifiers.

The command-line I used is this:

EMBLmyGFF3 test/6666666.419437.gff test/6666666.419437.contigs.fa -o test/test_new.embl

Any help in this regard would be highly appreciated

Juke34 commented 5 years ago

Thank you for reporting this problem. I had encountered similar problem. We could definitely improve the way the duplicates are detected and removed by EMBLmyGFF3. We are currently on the way to release a new implementation of the tool. We will look at this problem later one. During this time I would suggest to remove one of the duplicates manually from the EMBL file or the gff file. If there are many cases it could be fastidious. Otherwise you can try to remove duplicated features from gff file using gxf_to_gff3.pl from the GAAS repository (use -v option to see what has been fixed).

Anto007 commented 5 years ago

Thank you for your prompt response Juke34- much appreciated!

Juke34 commented 5 years ago

Using gff3_sp_fix_features_locations_duplicated.pl from the GAAS repository before EMBLmyGFF3 should fix those cases for now.

Anto007 commented 5 years ago

Thanks but neither gxf_to_gff3.pl nor gff3_sp_fix_features_locations_duplicated.pl helps with converting my gff files (bacterial gff file from RAST annotation webserver) to submission-ready embl format on EMBLmyGFF3. Irrespective of whether I do gxf_to_gff3.pl or gff3_sp_fix_features_locations_duplicated.pl on my gff files, I get the following error when I try to run the new output gff file on EMBLmyGFF3:

Please enter new value: 11 Traceback (most recent call last): ] File "/usr/local/bin/EMBLmyGFF3", line 9, in load_entry_point('EMBLmyGFF3==1.2.4', 'console_scripts', 'EMBLmyGFF3')() File "/usr/local/lib/python2.7/dist-packages/EMBLmyGFF3-1.2.4-py2.7.egg/EMBLmyGFF3/EMBLmyGFF3.py", line 1369, in main writer.write_all( outfile ) File "/usr/local/lib/python2.7/dist-packages/EMBLmyGFF3-1.2.4-py2.7.egg/EMBLmyGFF3/EMBLmyGFF3.py", line 1187, in write_all out.write( self.FT() ) # FT - feature table data (>=2 per entry) File "/usr/local/lib/python2.7/dist-packages/EMBLmyGFF3-1.2.4-py2.7.egg/EMBLmyGFF3/EMBLmyGFF3.py", line 737, in FT output += str(f) File "/usr/local/lib/python2.7/dist-packages/EMBLmyGFF3-1.2.4-py2.7.egg/EMBLmyGFF3/modules/feature.py", line 179, in repr output += feature_l3._feature_as_EMBL(self.no_wrap_qualifier) if feature_l3.type not in feature_l3.remove else "" File "/usr/local/lib/python2.7/dist-packages/EMBLmyGFF3-1.2.4-py2.7.egg/EMBLmyGFF3/modules/feature.py", line 207, in _feature_as_EMBL output += qualifier.embl_format(no_wrap_qualifier) File "/usr/local/lib/python2.7/dist-packages/EMBLmyGFF3-1.2.4-py2.7.egg/EMBLmyGFF3/modules/qualifier.py", line 146, in embl_format output += multiline("FT", string, wrap=59, no_wrap = no_wrap) File "/usr/local/lib/python2.7/dist-packages/EMBLmyGFF3-1.2.4-py2.7.egg/EMBLmyGFF3/modules/utilities.py", line 69, in multiline output,lastLine = _splitStringMultiline(output, data, wrap, splitW, split_char) File "/usr/local/lib/python2.7/dist-packages/EMBLmyGFF3-1.2.4-py2.7.egg/EMBLmyGFF3/modules/utilities.py", line 126, in _splitStringMultiline splitLoc = _splitWordsMax(string,wrap,splitW,split_char) File "/usr/local/lib/python2.7/dist-packages/EMBLmyGFF3-1.2.4-py2.7.egg/EMBLmyGFF3/modules/utilities.py", line 168, in _splitWordsMax newString += " "+words.pop(0) IndexError: pop from empty list

This is the command line I used: EMBLmyGFF3 NGKPC421_Chromosome.gff NGKPC421_Chromosome.fa -o NGKPC421_Chromosome.embl

Not sure if I'm doing something incorrect here

Juke34 commented 5 years ago

Yes this error does not seem to be related to duplicated features. It's something else.

Anto007 commented 5 years ago

Many thanks for super-quick response.

Output from verbose mode: 12:15:18 DEBUG feature: Qualifier: standard_name - ['Arsenate reductase (EC 1.20.4.1)'] 12:15:18 DEBUG feature: Adding value '['Arsenate reductase (EC 1.20.4.1)']' to qualifier 'standard_name' 12:15:18 DEBUG feature: val Arsenate reductase (EC 1.20.4.1) alredy exist (list case) Traceback (most recent call last): File "/usr/local/bin/EMBLmyGFF3", line 9, in load_entry_point('EMBLmyGFF3==1.2.4', 'console_scripts', 'EMBLmyGFF3')() File "/usr/local/lib/python2.7/dist-packages/EMBLmyGFF3-1.2.4-py2.7.egg/EMBLmyGFF3/EMBLmyGFF3.py", line 1369, in main writer.write_all( outfile ) File "/usr/local/lib/python2.7/dist-packages/EMBLmyGFF3-1.2.4-py2.7.egg/EMBLmyGFF3/EMBLmyGFF3.py", line 1187, in write_all out.write( self.FT() ) # FT - feature table data (>=2 per entry) File "/usr/local/lib/python2.7/dist-packages/EMBLmyGFF3-1.2.4-py2.7.egg/EMBLmyGFF3/EMBLmyGFF3.py", line 737, in FT output += str(f) File "/usr/local/lib/python2.7/dist-packages/EMBLmyGFF3-1.2.4-py2.7.egg/EMBLmyGFF3/modules/feature.py", line 179, in repr output += feature_l3._feature_as_EMBL(self.no_wrap_qualifier) if feature_l3.type not in feature_l3.remove else "" File "/usr/local/lib/python2.7/dist-packages/EMBLmyGFF3-1.2.4-py2.7.egg/EMBLmyGFF3/modules/feature.py", line 207, in _feature_as_EMBL output += qualifier.embl_format(no_wrap_qualifier) File "/usr/local/lib/python2.7/dist-packages/EMBLmyGFF3-1.2.4-py2.7.egg/EMBLmyGFF3/modules/qualifier.py", line 146, in embl_format output += multiline("FT", string, wrap=59, no_wrap = no_wrap) File "/usr/local/lib/python2.7/dist-packages/EMBLmyGFF3-1.2.4-py2.7.egg/EMBLmyGFF3/modules/utilities.py", line 69, in multiline output,lastLine = _splitStringMultiline(output, data, wrap, splitW, split_char) File "/usr/local/lib/python2.7/dist-packages/EMBLmyGFF3-1.2.4-py2.7.egg/EMBLmyGFF3/modules/utilities.py", line 126, in _splitStringMultiline splitLoc = _splitWordsMax(string,wrap,splitW,split_char) File "/usr/local/lib/python2.7/dist-packages/EMBLmyGFF3-1.2.4-py2.7.egg/EMBLmyGFF3/modules/utilities.py", line 168, in _splitWordsMax newString += " "+words.pop(0) IndexError: pop from empty list

This is how the first few lines of my gff file look:

gff-version 3

Klebsiella_quasipneumoniae_421 FIG gene 269 5255553 . + 2 ID=nbis_NEW-gene-1;Name=Phage protein Klebsiella_quasipneumoniae_421 FIG mRNA 269 5255553 . + 2 ID=nbis_noL2id-cds-1;Parent=nbis_NEW-gene-1;Name=Phage protein Klebsiella_quasipneumoniae_421 FIG exon 269 1119 . + . ID=nbis_NEW-exon-1;Parent=nbis_noL2id-cds-1;Name=Phage protein Klebsiella_quasipneumoniae_421 FIG exon 1378 1494 . + . ID=nbis_NEW-exon-2;Parent=nbis_noL2id-cds-1;Name=Phage protein Klebsiella_quasipneumoniae_421 FIG exon 1502 2062 . + . ID=nbis_NEW-exon-3;Parent=nbis_noL2id-cds-1;Name=Phage protein Klebsiella_quasipneumoniae_421 FIG exon 2115 2330 . + . ID=nbis_NEW-exon-4;Parent=nbis_noL2id-cds-1;Name=Phage protein Klebsiella_quasipneumoniae_421 FIG exon 2575 2694 . + . ID=nbis_NEW-exon-5;Parent=nbis_noL2id-cds-1;Name=Phage protein

Juke34 commented 5 years ago

I will need the whole record (gene, mRNA, exon, cds ...) that contains `Name=Arsenate reductase (EC 1.20.4.1)'. It sounds like a simple problem but I want to be sure to fix it properly.

Anto007 commented 5 years ago

Thanks again; I found 3 records in my gff file and they are pasted here:

Klebsiella_quasipneumoniae_421 FIG CDS 23164 23586 . + 1 ID=fig|6666666.410353.peg.49;Parent=nbis_noL2id-cds-1;Name=Arsenate reductase (EC 1.20.4.1);Ontology_term=KEGG_ENZYME:1.20.4.1

Klebsiella_quasipneumoniae_421 FIG CDS 530960 531319 . - 2 ID=fig|6666666.410353.peg.527;Parent=nbis_noL2id-cds-1;Name=Arsenate reductase (EC 1.20.4.1);Ontology_term=KEGG_ENZYME:1.20.4.1

Klebsiella_quasipneumoniae_421 FIG CDS 5255320 5255553 . + 1 ID=fig|6666666.410353.peg.4975;Parent=nbis_noL2id-cds-1;Name=Arsenate reductase (EC 1.20.4.1);Ontology_term=KEGG_ENZYME:1.20.4.1

Juke34 commented 5 years ago

I don't succeed to reproduce your problem. First install the last version, it's written that you are running the version 1.2.4, while you should use version 1.2.5.

pip uninstall EMBLmyGFF3
pip install git+https://github.com/NBISweden/EMBLmyGFF3.git

Secondly I think there is a problem introduced by to the use of gxf_to_gff3.pl and gff3_sp_fix_features_locations_duplicated.pl because the output from RAST is specific.... I guess they provide only CDS and not parent/child feature relationship (Please confirm me or copy past a piece of the original RAST output). It's only for prokaryote so you have one ads feature by gene (no intron). So using our perl script link all CDS over only one huge gene feature. If you check your current gff it contains only one gene feature (or maybe one gene feature by sequence if you have several sequences...). So the way to go is

Anto007 commented 5 years ago

Thank you but strangely, there's no --locus feature as you mention for gxf_to_gff3.pl

Juke34 commented 5 years ago

My mistake the parameter is called '-c' or '--ct'. So use '-c ID'

Anto007 commented 5 years ago

Can I send you the original gff file from RAST so that you can take a direct look? You can find it here: https://transferxl.com/08m6yZbxSVfvG

Anto007 commented 5 years ago

FYI: gxf_to_gff3.pl -g NGKPC421_Chromosome.gff -o fixed_NGKPC421_Chromosome.gff -c ID =>GFF version parser used: 3 4969 warning messages: WARNING gff3 reader level3: No Parent attribute found 336 warning messages: WARNING gff3 reader level2 : No Parent attribute found for GFF3 file parsed Job done in 5 seconds

Juke34 commented 5 years ago

Yes, It looks like I thought. And there are plenty of duplicates. So following the steps I told you should be fine. Except that I saw a problem in gff3_sp_fix_features_locations_duplicated.pl because I was always checking CDS and you have tRNA and rRNA features that do not have CDS. It is now fixed but you have to git pull the GAAS repo.

Anto007 commented 5 years ago

Thank you again. You mean I just need to delete gff3_sp_fix_features_locations_duplicated.pl on my system and replace it with the new one at GAAS/annotation/Tools/Util/gff/ Btw why there is yet another gff3_sp_fix_features_locations_duplicated.pl at GAAS/annotation/Tools/bin/ ? I use the one at GAAS/annotation/Tools/Util/gff/ Right?

Juke34 commented 5 years ago

In the bin it's just a link to the one in GAAS/annotation/Tools/Util/gff/. No need to delete anything. This is the magic of github. I meant from anywhere in your copy of the repository you type the command git pull, it will automatically update the repo.

Anto007 commented 5 years ago

Sorry git pull doesn't work on my workstation. I tried deleting the old .pl from GAAS/annotation/Tools/Util/gff/ and replacing it with the new .pl Ran again the .pl but my problem doesn't go away; I still get the following while running EMBLmyGFF3

19:36:23 DEBUG qualifier: No rule to format '"text"(single token) but not "<1-5 letters><5-9 digit integer>[.]"' 19:36:23 DEBUG qualifier: No rule to format '<integer; 1=universal table 1;2=non-universal table 2;...' Traceback (most recent call last): File "/usr/local/bin/EMBLmyGFF3", line 9, in load_entry_point('EMBLmyGFF3==1.2.5', 'console_scripts', 'EMBLmyGFF3')() File "/usr/local/lib/python2.7/dist-packages/EMBLmyGFF3/EMBLmyGFF3.py", line 1383, in main writer.write_all( outfile ) File "/usr/local/lib/python2.7/dist-packages/EMBLmyGFF3/EMBLmyGFF3.py", line 1199, in write_all out.write( self.FT() ) # FT - feature table data (>=2 per entry) File "/usr/local/lib/python2.7/dist-packages/EMBLmyGFF3/EMBLmyGFF3.py", line 748, in FT output += str(f) File "/usr/local/lib/python2.7/dist-packages/EMBLmyGFF3/modules/feature.py", line 148, in repr output = self._feature_as_EMBL(self.no_wrap_qualifier) if self.type not in self.remove else "" File "/usr/local/lib/python2.7/dist-packages/EMBLmyGFF3/modules/feature.py", line 207, in _feature_as_EMBL output += qualifier.embl_format(no_wrap_qualifier) File "/usr/local/lib/python2.7/dist-packages/EMBLmyGFF3/modules/qualifier.py", line 146, in embl_format output += multiline("FT", string, wrap=59, no_wrap = no_wrap) File "/usr/local/lib/python2.7/dist-packages/EMBLmyGFF3/modules/utilities.py", line 69, in multiline output,lastLine = _splitStringMultiline(output, data, wrap, splitW, split_char) File "/usr/local/lib/python2.7/dist-packages/EMBLmyGFF3/modules/utilities.py", line 126, in _splitStringMultiline splitLoc = _splitWordsMax(string,wrap,splitW,split_char) File "/usr/local/lib/python2.7/dist-packages/EMBLmyGFF3/modules/utilities.py", line 168, in _splitWordsMax newString += " "+words.pop(0) IndexError: pop from empty list

Did you get to run exactly what you told me on the file that I shared with you? I'm curious to know if you don't see the same errors that I'm getting here

Juke34 commented 5 years ago

I don't have the fasta to try the conversion. You can send it at jacques.dainat@nbis.se if you want.

Juke34 commented 5 years ago

I found a way to reproduce you error in EMBLmyGFF3. I will come back to you when it will be fixed.

Anto007 commented 5 years ago

Sorry, I missed that you would need the fasta file too. I've now e-mailed it to you. Many thanks!

Juke34 commented 5 years ago

IndexError: pop from empty listerror fixed in version 1.2.6.

Anto007 commented 5 years ago

Thank you so much for your prompt responses and your hard work. I'm getting no errors now upon running version 1.2.6 but the generated embl file unfortunately doesn't appear ready for ENA submission. I get the below error when I try to validate the newly generated embl file on the latest version of webin tool: java -jar webin-cli-1.8.2.jar -userName Webin-XXXX -password XXXXXX -context sequence -manifest NGKPC421_Chromosome.manifest -inputDir for_validation/ -outputDir validation_output/ -validate

2019-04-21T09:18:01 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 9033 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:01 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 11707 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 15162 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 17899 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 19685 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 25247 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 26881 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 33284 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 33526 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 35888 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 36292 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 42528 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 45325 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 45885 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 48823 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 50979 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 53865 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 54441 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 58048 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 60122 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 61428 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 61722 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 63067 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 64822 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 65574 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 67892 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 71871 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 90279 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 90323 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 91539 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 101572 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 105639 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 113434 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 114004 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 116083 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 117914 of NGKPC421_Chromosome.embl.gz]

Any ideas here?

Juke34 commented 5 years ago

Yes it’s a problem from the validator. You must not report the exon in the output to avoid that (in the readme I explain how to do so). They are useless anyway in your case because CDS=Exon you don’t have UTR.

Anto007 commented 5 years ago

Thanks again. So sorry to bother you again but I'm a bit confused from reading your README. My translation_gff_feature_to_embl_feature.json file now looks like this:

"_comment":{"source description": "The type of the feature (previously called the \"method\"). This is constrained to be either a term from the Sequence Ontology or an SO accession number. The latter al$ "five_prime_UTR": { "target": "5'UTR" }, "three_prime_UTR": { "target": "3'UTR" } } }, "protein_hmm_match": { "target": "standard_name" }, "exon": { "remove": true }, "transcript": { "target": "mRNA" } }

and I get the below error when I run EMBLmyGFF3 final_NGKPC421_Chromosome.gff3 NGKPC421_Chromosome.fa -o NGKPC421_Chromosome.embl -vvv

Traceback (most recent call last): File "/usr/local/bin/EMBLmyGFF3", line 9, in load_entry_point('EMBLmyGFF3==1.2.6', 'console_scripts', 'EMBLmyGFF3')() File "/usr/local/lib/python2.7/dist-packages/EMBLmyGFF3/EMBLmyGFF3.py", line 1383, in main writer.write_all( outfile ) File "/usr/local/lib/python2.7/dist-packages/EMBLmyGFF3/EMBLmyGFF3.py", line 1199, in write_all out.write( self.FT() ) # FT - feature table data (>=2 per entry) File "/usr/local/lib/python2.7/dist-packages/EMBLmyGFF3/EMBLmyGFF3.py", line 717, in FT force_uncomplete_features = self.force_uncomplete_features, uncompressed_log = self.uncompressed_log, no_wrap_qualifier = self.no_wrap_qualifier) File "/usr/local/lib/python2.7/dist-packages/EMBLmyGFF3/modules/feature.py", line 102, in init self._load_feature_translations(Feature.DEFAULT_FEATURE_TRANSLATION_FILE) File "/usr/local/lib/python2.7/dist-packages/EMBLmyGFF3/modules/feature.py", line 325, in _load_feature_translations data = json.load( open("%s/%s" % (local_dir, filename)) ) File "/usr/lib/python2.7/json/init.py", line 291, in load **kw) File "/usr/lib/python2.7/json/init.py", line 339, in loads return _default_decoder.decode(s) File "/usr/lib/python2.7/json/decoder.py", line 367, in decode raise ValueError(errmsg("Extra data", s, end, len(s))) ValueError: Extra data: line 10 column 2 - line 21 column 1 (char 439 - 575)

I think this is a simple problem and I'm most likely doing something silly here. I would like to express my heartfelt gratitude for your amazing patience and kind support in troubleshooting my issues

Juke34 commented 5 years ago

You have corrupted the translation_gff_feature_to_embl_feature.json file. Indeed as we can see here in what you have copied-pasted you have ...The latter al$ instead of ...The latter alternative is distinguished using the syntax SO:000000. In either case, it must be sequence_feature (SO:0000110) or an is_a child of it."},.
The text is not important but here "}, at the end of the line is mandatory for a non-corrupted json format. I guess you have copied the file from inside a terminal as suggests the $ at the end.
It seems you have an extra } after "target": "3'UTR".
The safest approach is to call EMBLmyGFF3 --expose_translations to get the files into the current directory and then you can modify them properly.

Anto007 commented 5 years ago

Many thanks for your response. _The latter alternative is distinguished using the syntax SO:000000. In either case, it must be sequence_feature (SO:0000110) or an isa child of it."}, is present intact on my json file. I noticed there were two extra } after "target": "3'UTR"; Sorry, I missed this before; thanks again for pointing this out! I removed the extra parentheses and everything went well. I also managed to validate the newly generated embl file on ENA's webin-cli tool (v1.8.2) and received this message "INFO : The submission has been validated successfully." I'm so grateful to you for your remarkable patience and prompt assistance in fixing this issue- I wish all bioinformatic tool developers in the world were as supportive as you! I'm happy to add here that from now on, your nice tools would be added to our existing pipelines for processing of all genome assemblies from all of our University's bioscience labs.

Juke34 commented 5 years ago

You're welcome. Glad to hear you plan to use it broadly at your University.

TGeneralovic commented 4 years ago

Hi Jacques (@Juke34 )

I wonder if you could aid us in a submission with a very similar issue.

An annotation generated by BRAKER2 pipeline for submission to ENA. Initial attempts to validate the flatfile for ena submission failed with multiple ERROR: "mRNA" Features locations are duplicated - consider merging qualifiers. [ line: 4559 of iArcPla.TrioY.embl.gz, line: 4555 of iArcPla.TrioY.embl.gz] hits (many lines of them). So after seeing this post I processed the annotated gtf using;

agat_sp_gxf_to_gff3.pl -g iArcPla.TrioY.gtf -o iArcPla.TrioY.gff3 agat_sp_fix_features_locations_duplicated.pl -f iArcPla.TrioY.gff3 -o iArcPla.TrioY.fix.gff3

EMBLmyGFF3 \
        iArcPla.TrioY.fix.gff3 genome.fa \
        --topology linear \
        --molecule_type "genomic DNA" \
        --transl_table 1  \
        --species 'Arctia plantaginis' \
        --taxonomy INV \
        --locus_tag APLA \
        --project_id PRJEB36595 \
        --author 'Eugenie C. Yen, Shane A. McCarthy, Juan A. Galarza, Tomas N. Generalovic, Sarah Pelan, Petr Nguyen, Joana I. Meier, Ian A. Warren, Johanna Mappes, Richard Durbin, Chris D. Jiggins' \
        --rt 'A haplotype-resolved, de novo genome assembly for the wood tiger moth (Arctia plantaginis) through trio binning' \
        -k 'wood tiger moth; Arctia plantaginis; Lepidoptera; genome assembly; trio binning; annotation; population genomics' \
        --rl 'bioRxiv' \
        -o iArcPla.TrioY.embl

After validation we still see some duplicated features that are not being filtered out:

ERROR: "exon" Features locations are duplicated - consider merging qualifiers. [ line: 2773530 of iArcPla.TrioY.embl.gz,  line: 2773503 of iArcPla.TrioY.embl.gz]
ERROR: "exon" Features locations are duplicated - consider merging qualifiers. [ line: 6347413 of iArcPla.TrioY.embl.gz,  line: 6347375 of iArcPla.TrioY.embl.gz]
ERROR: "exon" Features locations are duplicated - consider merging qualifiers. [ line: 6347417 of iArcPla.TrioY.embl.gz,  line: 6347379 of iArcPla.TrioY.embl.gz]

Any help with this error would be greatly appreciated.

Regards, Tom

Juke34 commented 4 years ago

It is not recommended to submit exons they are already described within the transcripts location in the Embl format. So just remove them:

EMBLmyGFF3 --expose_translations

then modify the following file translation_gff_feature_to_embl_feature.json in order to get

"exon": {
    "remove": true
}

then re-run the conversion, it should be fine now.

TGeneralovic commented 4 years ago

Thank you,

So re-running the EMBLmyGFF3 with the added EMBLmyGFF3 --expose_translations parameter to get the jsons and adding "remove": true to the translation_gff_feature_to_embl_feature.json and repeating again revealed the same result. Am I mis-understanding the instructions?

Would the EMBLmyGFF3 --expose_translations be ran as single command and not just an added parameter to the EMBLmyGFF3?

Thanks in advance.

Juke34 commented 4 years ago

By default EMBLmyGFF3 will use json file located in the working folder. By default there is none. Doing EMBLmyGFF3 --expose_translations allows to get this json files locally. So if you modified the json file(s) properly and re-run the normal command, EMBLmyGFF3 should use the locally modified json file(s). Check your embl file. Do you see any exon feature remaining? If yes something went wrong (did you re-run in the same folder? Did you removed the local json file? did you save the change?).

TGeneralovic commented 4 years ago

Great! It was that I added the parameter into the EMBLYmyGFF3 not independently so the json was recreated at default. We have a validated flatfile. Thank you for the swift responses.

Jeepee8820 commented 4 years ago

Hi Jacques,

I am sorry to re-open this thread but I also have a very similar issue with annotations generated by Prokka v1.14.5.

Initial attempts to validate the flatfile for ENA submission failed with multiple ERROR: "misc_RNA" Features locations are duplicated - consider merging qualifiers. hits (many lines). This apparently also concerns the features "tRNA" and "rRNA".

These features appears in the following order in the EMBL flatfile generated: 1) gene 2) misc_RNA OR tRNA OR rRNA 3) mRNA 4) misc_RNA OR tRNA OR rRNA

Where 2) and 4) are thus duplicated for an unknown reason which seems to be due to EMBLmyGFF3. The gff appears to be correct and has them in the order 1), 3) and 4). Using the agat_sp_fix_features_locations_duplicated.pl does not solve this duplication problem. Modifying the file translation_gff_feature_to_embl_feature.json to set these three features to "remove": true does solve the validation problem but it would be great to maintain these in the annotations.

Do you see any fix to avoid these features to be duplicated so that they are reported in the correct order, i.e. 1), 3) and 4)?

Thanks in advance for your support.

Juke34 commented 4 years ago

Could you provide a sample of the GFF file (top lines with several CDS features) before and after agat_sp_fix_features_locations_duplicated.pl?

Jeepee8820 commented 4 years ago

For sure! This would be a sample of the original GFF file with the first five CDS:

##gff-version 3
##sequence-region gnl|ZW|CFBP2044_1 1 5079002
gnl|ZW|CFBP2044_1   prokka  gene    1   1329    .   +   .   ID=CFBP2044_00010_gene;Name=dnaA;gene=dnaA;locus_tag=CFBP2044_00010
gnl|ZW|CFBP2044_1   prokka  mRNA    1   1329    .   +   .   ID=CFBP2044_00010_mRNA;Name=dnaA;gene=dnaA;locus_tag=CFBP2044_00010
gnl|ZW|CFBP2044_1   Prodigal:002006 CDS 1   1329    .   +   0   ID=CFBP2044_00010;Parent=CFBP2044_00010_gene,CFBP2044_00010_mRNA;Name=dnaA;db_xref=COG:COG0593;gene=dnaA;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P03004;locus_tag=CFBP2044_00010;product=Chromosomal replication initiator protein DnaA;protein_id=gnl|ZW|CFBP2044_00010
gnl|ZW|CFBP2044_1   prokka  gene    1607    2707    .   +   .   ID=CFBP2044_00020_gene;Name=dnaN;gene=dnaN;locus_tag=CFBP2044_00020
gnl|ZW|CFBP2044_1   prokka  mRNA    1607    2707    .   +   .   ID=CFBP2044_00020_mRNA;Name=dnaN;gene=dnaN;locus_tag=CFBP2044_00020
gnl|ZW|CFBP2044_1   Prodigal:002006 CDS 1607    2707    .   +   0   ID=CFBP2044_00020;Parent=CFBP2044_00020_gene,CFBP2044_00020_mRNA;Name=dnaN;db_xref=COG:COG0592;gene=dnaN;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:Q9I7C4;locus_tag=CFBP2044_00020;product=Beta sliding clamp;protein_id=gnl|ZW|CFBP2044_00020
gnl|ZW|CFBP2044_1   prokka  gene    3433    4539    .   +   .   ID=CFBP2044_00030_gene;Name=recF;gene=recF;locus_tag=CFBP2044_00030
gnl|ZW|CFBP2044_1   prokka  mRNA    3433    4539    .   +   .   ID=CFBP2044_00030_mRNA;Name=recF;gene=recF;locus_tag=CFBP2044_00030
gnl|ZW|CFBP2044_1   Prodigal:002006 CDS 3433    4539    .   +   0   ID=CFBP2044_00030;Parent=CFBP2044_00030_gene,CFBP2044_00030_mRNA;Name=recF;db_xref=COG:COG1195;gene=recF;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P0A7H0;locus_tag=CFBP2044_00030;product=DNA replication and repair protein RecF;protein_id=gnl|ZW|CFBP2044_00030
gnl|ZW|CFBP2044_1   prokka  gene    4654    7098    .   +   .   ID=CFBP2044_00040_gene;Name=gyrB;gene=gyrB;locus_tag=CFBP2044_00040
gnl|ZW|CFBP2044_1   prokka  mRNA    4654    7098    .   +   .   ID=CFBP2044_00040_mRNA;Name=gyrB;gene=gyrB;locus_tag=CFBP2044_00040
gnl|ZW|CFBP2044_1   Prodigal:002006 CDS 4654    7098    .   +   0   ID=CFBP2044_00040;Parent=CFBP2044_00040_gene,CFBP2044_00040_mRNA;eC_number=5.6.2.2;Name=gyrB;db_xref=COG:COG0187;gene=gyrB;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P0A2I3;locus_tag=CFBP2044_00040;product=DNA gyrase subunit B;protein_id=gnl|ZW|CFBP2044_00040
gnl|ZW|CFBP2044_1   prokka  gene    7167    8003    .   +   .   ID=CFBP2044_00050_gene;locus_tag=CFBP2044_00050
gnl|ZW|CFBP2044_1   prokka  mRNA    7167    8003    .   +   .   ID=CFBP2044_00050_mRNA;locus_tag=CFBP2044_00050
gnl|ZW|CFBP2044_1   Prodigal:002006 CDS 7167    8003    .   +   0   ID=CFBP2044_00050;Parent=CFBP2044_00050_gene,CFBP2044_00050_mRNA;inference=ab initio prediction:Prodigal:002006;locus_tag=CFBP2044_00050;product=hypothetical protein;protein_id=gnl|ZW|CFBP2044_00050

And here an example of the first problematic feature:

gnl|ZW|CFBP2044_1   prokka  gene    47283   47358   .   -   .   ID=CFBP2044_00380_gene;locus_tag=CFBP2044_00380
gnl|ZW|CFBP2044_1   prokka  mRNA    47283   47358   .   -   .   ID=CFBP2044_00380_mRNA;locus_tag=CFBP2044_00380
gnl|ZW|CFBP2044_1   Infernal:001001 misc_RNA    47283   47358   68.8    -   .   ID=CFBP2044_00380;Parent=CFBP2044_00380_gene,CFBP2044_00380_mRNA;Note="Xanthomonas sRNA sX9";accession=RF02228;inference=COORDINATES:profile:Infernal:001001;locus_tag=CFBP2044_00380;product=sX9

This would be a sample of the GFF file with the first five CDS after agat_sp_fix_features_locations_duplicated.pl:

##gff-version 3
##sequence-region gnl|ZW|CFBP2044_1 1 5079002
gnl|ZW|CFBP2044_1   prokka  gene    1   1329    .   +   .   ID=nbis-gene-1;Name=dnaA;gene=dnaA;locus_tag=CFBP2044_00010
gnl|ZW|CFBP2044_1   prokka  mRNA    1   1329    .   +   .   ID=CFBP2044_00010_gene;Parent=nbis-gene-1;Name=dnaA;gene=dnaA;locus_tag=CFBP2044_00010
gnl|ZW|CFBP2044_1   Prodigal:002006 exon    1   1329    .   +   .   ID=nbis-exon-7325;Parent=CFBP2044_00010_gene;Name=dnaA;db_xref=COG:COG0593;gene=dnaA;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P03004;locus_tag=CFBP2044_00010;product=Chromosomal replication initiator protein DnaA;protein_id=gnl|ZW|CFBP2044_00010
gnl|ZW|CFBP2044_1   Prodigal:002006 CDS 1   1329    .   +   0   ID=CFBP2044_00010;Parent=CFBP2044_00010_gene;Name=dnaA;db_xref=COG:COG0593;gene=dnaA;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P03004;locus_tag=CFBP2044_00010;product=Chromosomal replication initiator protein DnaA;protein_id=gnl|ZW|CFBP2044_00010
gnl|ZW|CFBP2044_1   prokka  gene    1607    2707    .   +   .   ID=CFBP2044_00020_gene;Name=dnaN;gene=dnaN;locus_tag=CFBP2044_00020
gnl|ZW|CFBP2044_1   prokka  mRNA    1607    2707    .   +   .   ID=CFBP2044_00020_mRNA;Parent=CFBP2044_00020_gene;Name=dnaN;gene=dnaN;locus_tag=CFBP2044_00020
gnl|ZW|CFBP2044_1   prokka  exon    1607    2707    .   +   .   ID=nbis-exon-2;Parent=CFBP2044_00020_mRNA;Name=dnaN;gene=dnaN;locus_tag=CFBP2044_00020
gnl|ZW|CFBP2044_1   Prodigal:002006 CDS 1607    2707    .   +   0   ID=CFBP2044_00020;Parent=CFBP2044_00020_mRNA;Name=dnaN;db_xref=COG:COG0592;gene=dnaN;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:Q9I7C4;locus_tag=CFBP2044_00020;product=Beta sliding clamp;protein_id=gnl|ZW|CFBP2044_00020
gnl|ZW|CFBP2044_1   prokka  gene    3433    4539    .   +   .   ID=nbis-gene-3;Name=recF;gene=recF;locus_tag=CFBP2044_00030
gnl|ZW|CFBP2044_1   prokka  mRNA    3433    4539    .   +   .   ID=CFBP2044_00030_gene;Parent=nbis-gene-3;Name=recF;gene=recF;locus_tag=CFBP2044_00030
gnl|ZW|CFBP2044_1   Prodigal:002006 exon    3433    4539    .   +   .   ID=nbis-exon-6708;Parent=CFBP2044_00030_gene;Name=recF;db_xref=COG:COG1195;gene=recF;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P0A7H0;locus_tag=CFBP2044_00030;product=DNA replication and repair protein RecF;protein_id=gnl|ZW|CFBP2044_00030
gnl|ZW|CFBP2044_1   Prodigal:002006 CDS 3433    4539    .   +   0   ID=CFBP2044_00030;Parent=CFBP2044_00030_gene;Name=recF;db_xref=COG:COG1195;gene=recF;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P0A7H0;locus_tag=CFBP2044_00030;product=DNA replication and repair protein RecF;protein_id=gnl|ZW|CFBP2044_00030
gnl|ZW|CFBP2044_1   prokka  gene    4654    7098    .   +   .   ID=nbis-gene-4;Name=gyrB;gene=gyrB;locus_tag=CFBP2044_00040
gnl|ZW|CFBP2044_1   prokka  mRNA    4654    7098    .   +   .   ID=CFBP2044_00040_gene;Parent=nbis-gene-4;Name=gyrB;gene=gyrB;locus_tag=CFBP2044_00040
gnl|ZW|CFBP2044_1   Prodigal:002006 exon    4654    7098    .   +   .   ID=nbis-exon-7118;Parent=CFBP2044_00040_gene;Name=gyrB;db_xref=COG:COG0187;eC_number=5.6.2.2;gene=gyrB;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P0A2I3;locus_tag=CFBP2044_00040;product=DNA gyrase subunit B;protein_id=gnl|ZW|CFBP2044_00040
gnl|ZW|CFBP2044_1   Prodigal:002006 CDS 4654    7098    .   +   0   ID=CFBP2044_00040;Parent=CFBP2044_00040_gene;Name=gyrB;db_xref=COG:COG0187;eC_number=5.6.2.2;gene=gyrB;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P0A2I3;locus_tag=CFBP2044_00040;product=DNA gyrase subunit B;protein_id=gnl|ZW|CFBP2044_00040
gnl|ZW|CFBP2044_1   prokka  gene    7167    8003    .   +   .   ID=nbis-gene-5;locus_tag=CFBP2044_00050
gnl|ZW|CFBP2044_1   prokka  mRNA    7167    8003    .   +   .   ID=CFBP2044_00050_gene;Parent=nbis-gene-5;locus_tag=CFBP2044_00050
gnl|ZW|CFBP2044_1   Prodigal:002006 exon    7167    8003    .   +   .   ID=nbis-exon-7342;Parent=CFBP2044_00050_gene;inference=ab initio prediction:Prodigal:002006;locus_tag=CFBP2044_00050;product=hypothetical protein;protein_id=gnl|ZW|CFBP2044_00050
gnl|ZW|CFBP2044_1   Prodigal:002006 CDS 7167    8003    .   +   0   ID=CFBP2044_00050;Parent=CFBP2044_00050_gene;inference=ab initio prediction:Prodigal:002006;locus_tag=CFBP2044_00050;product=hypothetical protein;protein_id=gnl|ZW|CFBP2044_00050

And here an example of the first problematic feature:

gnl|ZW|CFBP2044_1   prokka  gene    47283   47358   .   -   .   ID=CFBP2044_00380_gene;locus_tag=CFBP2044_00380
gnl|ZW|CFBP2044_1   Infernal:001001 misc_RNA    47283   47358   68.8    -   .   ID=CFBP2044_00380;Parent=CFBP2044_00380_gene,CFBP2044_00380_mRNA;Note="Xanthomonas sRNA sX9";accession=RF02228;inference=COORDINATES:profile:Infernal:001001;locus_tag=CFBP2044_00380;product=sX9
gnl|ZW|CFBP2044_1   prokka  mRNA    47283   47358   .   -   .   ID=CFBP2044_00380_mRNA;Parent=CFBP2044_00380_gene;locus_tag=CFBP2044_00380
gnl|ZW|CFBP2044_1   prokka  exon    47283   47358   .   -   .   ID=nbis-exon-38;Parent=CFBP2044_00380_mRNA;locus_tag=CFBP2044_00380

Thanks in advance

Juke34 commented 4 years ago

Ok the syntax of the two samples sounds fine, I was wondering if you would have used Prokka with extra parameters that can mess-up the gff.

You show 2 problematic features (I would call it records, several features linked to each other). In the second case the problem is quite clear, you have misc_RNA and mRNA same locations hold by a same gene. Only one of them should be kept. We should check if it is AGAT that introduce the mRNA. Could you show all features at this location 47283 47358 before running AGAT?

Jeepee8820 commented 4 years ago

Sorry the first problematic record I provided for the original file was the wrong locus tag. I have now edited my post to show all features at this location. The mRNA is already introduced by Prokka. Thanks

Juke34 commented 4 years ago

Interesting, the problem is related to Prokka then, it should not provide both misc_RNA and mRNA for the same location, it should only define one. You will have to edit the file manually to remove duplicated features.

Jeepee8820 commented 4 years ago

OK thanks a lot for the speedy reply. I reported this problem on the Prokka Github https://github.com/tseemann/prokka/issues/506

VDaric commented 6 months ago

I have the same issue. When I try to validate the embl file processed by EMBLmyGFF3 I have a lot of those duplicated locations. Sorry for my question that is not directly related to EMBLmyGFF3 . I'm desperately searching for the gxf_to_gff3.pl file. I can't seem to find it in the GAAS Toolkit....

Thank you for your understanding and assistance.

Juke34 commented 6 months ago

Right it is now called agat_convert_sp_gxf2gxf.pl and is available in AGAT. Tjhe other script is Called agat_sp_fix_features_locations_duplicated.pl and available in AGAT too.

VDaric commented 6 months ago

I believe I managed to encounter all the problems described in this threads ! :)

I resolved them all thanks to your answers @Juke34. Thanks a lot !

I still have one issue. I have several lines that looks like this one:

ERROR: Invalid amino acid "l" in translation. [ line: 1630948 of My_Org_noexons.embl.gz]

I am using webin-cli-6.9.0 to validate the embl file I generated with EMBLmyGFF3.

I am trying tio validate an ascidian genomic assembly with several scaffolds (assembled with HiFiasm). One of them is a mitochondrial one (assembled with mitoHiFi). I've manually changed embl file

/transl_table=1 to /transl_table=13

for all mitochondrial CDS.

For exemle I have this in embl file :+1:

FT   gene            130..471
FT                   /locus_tag="CVLEPA_LOCUS11895"
FT                   /note="ID:Cvlepa.mt.BANY2021.S176.g032142"
FT                   /note="source:mitoHiFi"
FT                   /standard_name="ND3"
FT   mRNA            130..471
FT                   /locus_tag="CVLEPA_LOCUS11895"
FT                   /note="ID:Cvlepa.mt.BANY2021.S176.g032142.01.t"
FT                   /note="source:mitoHiFi"
FT   CDS             <130..>471
FT                   /codon_start=2
FT                   /locus_tag="CVLEPA_LOCUS11895"
FT                   /note="ID:Cvlepa.mt.BANY2021.S176.g032142.01.p.cds"
FT                   /note="source:mitoHiFi"
FT                   /standard_name="ND3"
FT                   /transl_table=13
FT                   /translation="length.114"

transl_table=13 is for "The Ascidian Mitochondrial Code")

I still have that Invalid amino acid "l" message. Not very sure everything I am doing is legit.

Any suggestion will be greatly appreciated.

Thanks !

Juke34 commented 6 months ago

/translation="length.114" is wrong it is supposed to be an amino acid String. I guess the problem is already in your GFF file. The translation is not mandatory. You can remove it when processing with EMBLmyGFF3. Or you can populate it with --translate option from EMBLmyGFF3.

VDaric commented 6 months ago

Oh ! Thanks !

I should have read embl file specs :/

Juke34 commented 6 months ago

No worries, specs are particularly verbose...