Closed Anto007 closed 5 years ago
Thank you for reporting this problem. I had encountered similar problem. We could definitely improve the way the duplicates are detected and removed by EMBLmyGFF3. We are currently on the way to release a new implementation of the tool. We will look at this problem later one. During this time I would suggest to remove one of the duplicates manually from the EMBL file or the gff file. If there are many cases it could be fastidious. Otherwise you can try to remove duplicated features from gff file using gxf_to_gff3.pl from the GAAS repository (use -v option to see what has been fixed).
Thank you for your prompt response Juke34- much appreciated!
Using gff3_sp_fix_features_locations_duplicated.pl
from the GAAS repository before EMBLmyGFF3 should fix those cases for now.
Thanks but neither gxf_to_gff3.pl nor gff3_sp_fix_features_locations_duplicated.pl helps with converting my gff files (bacterial gff file from RAST annotation webserver) to submission-ready embl format on EMBLmyGFF3. Irrespective of whether I do gxf_to_gff3.pl or gff3_sp_fix_features_locations_duplicated.pl on my gff files, I get the following error when I try to run the new output gff file on EMBLmyGFF3:
Please enter new value: 11
Traceback (most recent call last): ]
File "/usr/local/bin/EMBLmyGFF3", line 9, in
This is the command line I used: EMBLmyGFF3 NGKPC421_Chromosome.gff NGKPC421_Chromosome.fa -o NGKPC421_Chromosome.embl
Not sure if I'm doing something incorrect here
Yes this error does not seem to be related to duplicated features. It's something else.
Could you provide the output with verbose option: EMBLmyGFF3 NGKPC421_Chromosome.gff NGKPC421_Chromosome.fa -o NGKPC421_Chromosome.embl -vvv
Could you provide few lines of your gff3 file too.
Update to version 1.2.5
Many thanks for super-quick response.
Output from verbose mode:
12:15:18 DEBUG feature: Qualifier: standard_name - ['Arsenate reductase (EC 1.20.4.1)']
12:15:18 DEBUG feature: Adding value '['Arsenate reductase (EC 1.20.4.1)']' to qualifier 'standard_name'
12:15:18 DEBUG feature: val Arsenate reductase (EC 1.20.4.1) alredy exist (list case)
Traceback (most recent call last):
File "/usr/local/bin/EMBLmyGFF3", line 9, in
This is how the first few lines of my gff file look:
Klebsiella_quasipneumoniae_421 FIG gene 269 5255553 . + 2 ID=nbis_NEW-gene-1;Name=Phage protein Klebsiella_quasipneumoniae_421 FIG mRNA 269 5255553 . + 2 ID=nbis_noL2id-cds-1;Parent=nbis_NEW-gene-1;Name=Phage protein Klebsiella_quasipneumoniae_421 FIG exon 269 1119 . + . ID=nbis_NEW-exon-1;Parent=nbis_noL2id-cds-1;Name=Phage protein Klebsiella_quasipneumoniae_421 FIG exon 1378 1494 . + . ID=nbis_NEW-exon-2;Parent=nbis_noL2id-cds-1;Name=Phage protein Klebsiella_quasipneumoniae_421 FIG exon 1502 2062 . + . ID=nbis_NEW-exon-3;Parent=nbis_noL2id-cds-1;Name=Phage protein Klebsiella_quasipneumoniae_421 FIG exon 2115 2330 . + . ID=nbis_NEW-exon-4;Parent=nbis_noL2id-cds-1;Name=Phage protein Klebsiella_quasipneumoniae_421 FIG exon 2575 2694 . + . ID=nbis_NEW-exon-5;Parent=nbis_noL2id-cds-1;Name=Phage protein
I will need the whole record (gene, mRNA, exon, cds ...) that contains `Name=Arsenate reductase (EC 1.20.4.1)'. It sounds like a simple problem but I want to be sure to fix it properly.
Thanks again; I found 3 records in my gff file and they are pasted here:
Klebsiella_quasipneumoniae_421 FIG CDS 23164 23586 . + 1 ID=fig|6666666.410353.peg.49;Parent=nbis_noL2id-cds-1;Name=Arsenate reductase (EC 1.20.4.1);Ontology_term=KEGG_ENZYME:1.20.4.1
Klebsiella_quasipneumoniae_421 FIG CDS 530960 531319 . - 2 ID=fig|6666666.410353.peg.527;Parent=nbis_noL2id-cds-1;Name=Arsenate reductase (EC 1.20.4.1);Ontology_term=KEGG_ENZYME:1.20.4.1
Klebsiella_quasipneumoniae_421 FIG CDS 5255320 5255553 . + 1 ID=fig|6666666.410353.peg.4975;Parent=nbis_noL2id-cds-1;Name=Arsenate reductase (EC 1.20.4.1);Ontology_term=KEGG_ENZYME:1.20.4.1
I don't succeed to reproduce your problem. First install the last version, it's written that you are running the version 1.2.4, while you should use version 1.2.5.
pip uninstall EMBLmyGFF3
pip install git+https://github.com/NBISweden/EMBLmyGFF3.git
Secondly I think there is a problem introduced by to the use of gxf_to_gff3.pl and gff3_sp_fix_features_locations_duplicated.pl because the output from RAST is specific.... I guess they provide only CDS and not parent/child feature relationship (Please confirm me or copy past a piece of the original RAST output). It's only for prokaryote so you have one ads feature by gene (no intron). So using our perl script link all CDS over only one huge gene feature. If you check your current gff it contains only one gene feature (or maybe one gene feature by sequence if you have several sequences...). So the way to go is
gxf_to_gff3.pl
but with the option --locus ID
gff3_sp_statistics.pl
that you have as many genes as CDS.gff3_sp_fix_features_locations_duplicated.pl
Thank you but strangely, there's no --locus feature as you mention for gxf_to_gff3.pl
My mistake the parameter is called '-c' or '--ct'. So use '-c ID'
Can I send you the original gff file from RAST so that you can take a direct look? You can find it here: https://transferxl.com/08m6yZbxSVfvG
FYI: gxf_to_gff3.pl -g NGKPC421_Chromosome.gff -o fixed_NGKPC421_Chromosome.gff -c ID =>GFF version parser used: 3 4969 warning messages: WARNING gff3 reader level3: No Parent attribute found 336 warning messages: WARNING gff3 reader level2 : No Parent attribute found for GFF3 file parsed Job done in 5 seconds
Yes, It looks like I thought. And there are plenty of duplicates. So following the steps I told you should be fine.
Except that I saw a problem in gff3_sp_fix_features_locations_duplicated.pl
because I was always checking CDS and you have tRNA and rRNA features that do not have CDS. It is now fixed but you have to git pull
the GAAS repo.
Thank you again. You mean I just need to delete gff3_sp_fix_features_locations_duplicated.pl on my system and replace it with the new one at GAAS/annotation/Tools/Util/gff/ Btw why there is yet another gff3_sp_fix_features_locations_duplicated.pl at GAAS/annotation/Tools/bin/ ? I use the one at GAAS/annotation/Tools/Util/gff/ Right?
In the bin it's just a link to the one in GAAS/annotation/Tools/Util/gff/
.
No need to delete anything. This is the magic of github. I meant from anywhere in your copy of the repository you type the command git pull
, it will automatically update the repo.
Sorry git pull doesn't work on my workstation. I tried deleting the old .pl from GAAS/annotation/Tools/Util/gff/ and replacing it with the new .pl Ran again the .pl but my problem doesn't go away; I still get the following while running EMBLmyGFF3
19:36:23 DEBUG qualifier: No rule to format '"text"(single token) but not "<1-5 letters><5-9 digit integer>[.
Did you get to run exactly what you told me on the file that I shared with you? I'm curious to know if you don't see the same errors that I'm getting here
I don't have the fasta to try the conversion. You can send it at jacques.dainat@nbis.se if you want.
I found a way to reproduce you error in EMBLmyGFF3. I will come back to you when it will be fixed.
Sorry, I missed that you would need the fasta file too. I've now e-mailed it to you. Many thanks!
IndexError: pop from empty list
error fixed in version 1.2.6.
Thank you so much for your prompt responses and your hard work. I'm getting no errors now upon running version 1.2.6 but the generated embl file unfortunately doesn't appear ready for ENA submission. I get the below error when I try to validate the newly generated embl file on the latest version of webin tool: java -jar webin-cli-1.8.2.jar -userName Webin-XXXX -password XXXXXX -context sequence -manifest NGKPC421_Chromosome.manifest -inputDir for_validation/ -outputDir validation_output/ -validate
2019-04-21T09:18:01 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 9033 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:01 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 11707 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 15162 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 17899 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 19685 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 25247 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 26881 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 33284 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 33526 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 35888 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 36292 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 42528 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 45325 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 45885 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 48823 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 50979 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 53865 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 54441 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 58048 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 60122 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 61428 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 61722 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 63067 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 64822 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 65574 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 67892 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 71871 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 90279 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 90323 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 91539 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 101572 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 105639 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 113434 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 114004 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 116083 of NGKPC421_Chromosome.embl.gz] 2019-04-21T09:18:02 ERROR: Abutting features cannot be adjacent between neighbouring exons. [ line: 117914 of NGKPC421_Chromosome.embl.gz]
Any ideas here?
Yes it’s a problem from the validator. You must not report the exon in the output to avoid that (in the readme I explain how to do so). They are useless anyway in your case because CDS=Exon you don’t have UTR.
Thanks again. So sorry to bother you again but I'm a bit confused from reading your README. My translation_gff_feature_to_embl_feature.json file now looks like this:
"_comment":{"source description": "The type of the feature (previously called the \"method\"). This is constrained to be either a term from the Sequence Ontology or an SO accession number. The latter al$ "five_prime_UTR": { "target": "5'UTR" }, "three_prime_UTR": { "target": "3'UTR" } } }, "protein_hmm_match": { "target": "standard_name" }, "exon": { "remove": true }, "transcript": { "target": "mRNA" } }
and I get the below error when I run EMBLmyGFF3 final_NGKPC421_Chromosome.gff3 NGKPC421_Chromosome.fa -o NGKPC421_Chromosome.embl -vvv
Traceback (most recent call last):
File "/usr/local/bin/EMBLmyGFF3", line 9, in
I think this is a simple problem and I'm most likely doing something silly here. I would like to express my heartfelt gratitude for your amazing patience and kind support in troubleshooting my issues
You have corrupted the translation_gff_feature_to_embl_feature.json
file. Indeed as we can see here in what you have copied-pasted you have ...The latter al$
instead of ...The latter alternative is distinguished using the syntax SO:000000. In either case, it must be sequence_feature (SO:0000110) or an is_a child of it."},
.
The text is not important but here "},
at the end of the line is mandatory for a non-corrupted json format. I guess you have copied the file from inside a terminal as suggests the $
at the end.
It seems you have an extra }
after "target": "3'UTR"
.
The safest approach is to call EMBLmyGFF3 --expose_translations
to get the files into the current directory and then you can modify them properly.
Many thanks for your response. _The latter alternative is distinguished using the syntax SO:000000. In either case, it must be sequence_feature (SO:0000110) or an isa child of it."}, is present intact on my json file. I noticed there were two extra } after "target": "3'UTR"; Sorry, I missed this before; thanks again for pointing this out! I removed the extra parentheses and everything went well. I also managed to validate the newly generated embl file on ENA's webin-cli tool (v1.8.2) and received this message "INFO : The submission has been validated successfully." I'm so grateful to you for your remarkable patience and prompt assistance in fixing this issue- I wish all bioinformatic tool developers in the world were as supportive as you! I'm happy to add here that from now on, your nice tools would be added to our existing pipelines for processing of all genome assemblies from all of our University's bioscience labs.
You're welcome. Glad to hear you plan to use it broadly at your University.
Hi Jacques (@Juke34 )
I wonder if you could aid us in a submission with a very similar issue.
An annotation generated by BRAKER2 pipeline for submission to ENA. Initial attempts to validate the flatfile for ena submission failed with multiple ERROR: "mRNA" Features locations are duplicated - consider merging qualifiers. [ line: 4559 of iArcPla.TrioY.embl.gz, line: 4555 of iArcPla.TrioY.embl.gz]
hits (many lines of them). So after seeing this post I processed the annotated gtf using;
agat_sp_gxf_to_gff3.pl -g iArcPla.TrioY.gtf -o iArcPla.TrioY.gff3
agat_sp_fix_features_locations_duplicated.pl -f iArcPla.TrioY.gff3 -o iArcPla.TrioY.fix.gff3
EMBLmyGFF3 \
iArcPla.TrioY.fix.gff3 genome.fa \
--topology linear \
--molecule_type "genomic DNA" \
--transl_table 1 \
--species 'Arctia plantaginis' \
--taxonomy INV \
--locus_tag APLA \
--project_id PRJEB36595 \
--author 'Eugenie C. Yen, Shane A. McCarthy, Juan A. Galarza, Tomas N. Generalovic, Sarah Pelan, Petr Nguyen, Joana I. Meier, Ian A. Warren, Johanna Mappes, Richard Durbin, Chris D. Jiggins' \
--rt 'A haplotype-resolved, de novo genome assembly for the wood tiger moth (Arctia plantaginis) through trio binning' \
-k 'wood tiger moth; Arctia plantaginis; Lepidoptera; genome assembly; trio binning; annotation; population genomics' \
--rl 'bioRxiv' \
-o iArcPla.TrioY.embl
After validation we still see some duplicated features that are not being filtered out:
ERROR: "exon" Features locations are duplicated - consider merging qualifiers. [ line: 2773530 of iArcPla.TrioY.embl.gz, line: 2773503 of iArcPla.TrioY.embl.gz]
ERROR: "exon" Features locations are duplicated - consider merging qualifiers. [ line: 6347413 of iArcPla.TrioY.embl.gz, line: 6347375 of iArcPla.TrioY.embl.gz]
ERROR: "exon" Features locations are duplicated - consider merging qualifiers. [ line: 6347417 of iArcPla.TrioY.embl.gz, line: 6347379 of iArcPla.TrioY.embl.gz]
Any help with this error would be greatly appreciated.
Regards, Tom
It is not recommended to submit exons they are already described within the transcripts location in the Embl format. So just remove them:
EMBLmyGFF3 --expose_translations
then modify the following file translation_gff_feature_to_embl_feature.json
in order to get
"exon": {
"remove": true
}
then re-run the conversion, it should be fine now.
Thank you,
So re-running the EMBLmyGFF3 with the added EMBLmyGFF3 --expose_translations
parameter to get the jsons and adding "remove": true
to the translation_gff_feature_to_embl_feature.json
and repeating again revealed the same result. Am I mis-understanding the instructions?
Would the EMBLmyGFF3 --expose_translations
be ran as single command and not just an added parameter to the EMBLmyGFF3?
Thanks in advance.
By default EMBLmyGFF3 will use json file located in the working folder. By default there is none. Doing EMBLmyGFF3 --expose_translations
allows to get this json files locally. So if you modified the json file(s) properly and re-run the normal command, EMBLmyGFF3 should use the locally modified json file(s).
Check your embl file. Do you see any exon feature remaining? If yes something went wrong (did you re-run in the same folder? Did you removed the local json file? did you save the change?).
Great! It was that I added the parameter into the EMBLYmyGFF3 not independently so the json was recreated at default. We have a validated flatfile. Thank you for the swift responses.
Hi Jacques,
I am sorry to re-open this thread but I also have a very similar issue with annotations generated by Prokka v1.14.5.
Initial attempts to validate the flatfile for ENA submission failed with multiple
ERROR: "misc_RNA" Features locations are duplicated - consider merging qualifiers.
hits (many lines). This apparently also concerns the features "tRNA" and "rRNA".
These features appears in the following order in the EMBL flatfile generated: 1) gene 2) misc_RNA OR tRNA OR rRNA 3) mRNA 4) misc_RNA OR tRNA OR rRNA
Where 2) and 4) are thus duplicated for an unknown reason which seems to be due to EMBLmyGFF3. The gff appears to be correct and has them in the order 1), 3) and 4).
Using the agat_sp_fix_features_locations_duplicated.pl
does not solve this duplication problem.
Modifying the file translation_gff_feature_to_embl_feature.json
to set these three features to "remove": true
does solve the validation problem but it would be great to maintain these in the annotations.
Do you see any fix to avoid these features to be duplicated so that they are reported in the correct order, i.e. 1), 3) and 4)?
Thanks in advance for your support.
Could you provide a sample of the GFF file (top lines with several CDS features) before and after agat_sp_fix_features_locations_duplicated.pl
?
For sure! This would be a sample of the original GFF file with the first five CDS:
##gff-version 3
##sequence-region gnl|ZW|CFBP2044_1 1 5079002
gnl|ZW|CFBP2044_1 prokka gene 1 1329 . + . ID=CFBP2044_00010_gene;Name=dnaA;gene=dnaA;locus_tag=CFBP2044_00010
gnl|ZW|CFBP2044_1 prokka mRNA 1 1329 . + . ID=CFBP2044_00010_mRNA;Name=dnaA;gene=dnaA;locus_tag=CFBP2044_00010
gnl|ZW|CFBP2044_1 Prodigal:002006 CDS 1 1329 . + 0 ID=CFBP2044_00010;Parent=CFBP2044_00010_gene,CFBP2044_00010_mRNA;Name=dnaA;db_xref=COG:COG0593;gene=dnaA;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P03004;locus_tag=CFBP2044_00010;product=Chromosomal replication initiator protein DnaA;protein_id=gnl|ZW|CFBP2044_00010
gnl|ZW|CFBP2044_1 prokka gene 1607 2707 . + . ID=CFBP2044_00020_gene;Name=dnaN;gene=dnaN;locus_tag=CFBP2044_00020
gnl|ZW|CFBP2044_1 prokka mRNA 1607 2707 . + . ID=CFBP2044_00020_mRNA;Name=dnaN;gene=dnaN;locus_tag=CFBP2044_00020
gnl|ZW|CFBP2044_1 Prodigal:002006 CDS 1607 2707 . + 0 ID=CFBP2044_00020;Parent=CFBP2044_00020_gene,CFBP2044_00020_mRNA;Name=dnaN;db_xref=COG:COG0592;gene=dnaN;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:Q9I7C4;locus_tag=CFBP2044_00020;product=Beta sliding clamp;protein_id=gnl|ZW|CFBP2044_00020
gnl|ZW|CFBP2044_1 prokka gene 3433 4539 . + . ID=CFBP2044_00030_gene;Name=recF;gene=recF;locus_tag=CFBP2044_00030
gnl|ZW|CFBP2044_1 prokka mRNA 3433 4539 . + . ID=CFBP2044_00030_mRNA;Name=recF;gene=recF;locus_tag=CFBP2044_00030
gnl|ZW|CFBP2044_1 Prodigal:002006 CDS 3433 4539 . + 0 ID=CFBP2044_00030;Parent=CFBP2044_00030_gene,CFBP2044_00030_mRNA;Name=recF;db_xref=COG:COG1195;gene=recF;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P0A7H0;locus_tag=CFBP2044_00030;product=DNA replication and repair protein RecF;protein_id=gnl|ZW|CFBP2044_00030
gnl|ZW|CFBP2044_1 prokka gene 4654 7098 . + . ID=CFBP2044_00040_gene;Name=gyrB;gene=gyrB;locus_tag=CFBP2044_00040
gnl|ZW|CFBP2044_1 prokka mRNA 4654 7098 . + . ID=CFBP2044_00040_mRNA;Name=gyrB;gene=gyrB;locus_tag=CFBP2044_00040
gnl|ZW|CFBP2044_1 Prodigal:002006 CDS 4654 7098 . + 0 ID=CFBP2044_00040;Parent=CFBP2044_00040_gene,CFBP2044_00040_mRNA;eC_number=5.6.2.2;Name=gyrB;db_xref=COG:COG0187;gene=gyrB;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P0A2I3;locus_tag=CFBP2044_00040;product=DNA gyrase subunit B;protein_id=gnl|ZW|CFBP2044_00040
gnl|ZW|CFBP2044_1 prokka gene 7167 8003 . + . ID=CFBP2044_00050_gene;locus_tag=CFBP2044_00050
gnl|ZW|CFBP2044_1 prokka mRNA 7167 8003 . + . ID=CFBP2044_00050_mRNA;locus_tag=CFBP2044_00050
gnl|ZW|CFBP2044_1 Prodigal:002006 CDS 7167 8003 . + 0 ID=CFBP2044_00050;Parent=CFBP2044_00050_gene,CFBP2044_00050_mRNA;inference=ab initio prediction:Prodigal:002006;locus_tag=CFBP2044_00050;product=hypothetical protein;protein_id=gnl|ZW|CFBP2044_00050
And here an example of the first problematic feature:
gnl|ZW|CFBP2044_1 prokka gene 47283 47358 . - . ID=CFBP2044_00380_gene;locus_tag=CFBP2044_00380
gnl|ZW|CFBP2044_1 prokka mRNA 47283 47358 . - . ID=CFBP2044_00380_mRNA;locus_tag=CFBP2044_00380
gnl|ZW|CFBP2044_1 Infernal:001001 misc_RNA 47283 47358 68.8 - . ID=CFBP2044_00380;Parent=CFBP2044_00380_gene,CFBP2044_00380_mRNA;Note="Xanthomonas sRNA sX9";accession=RF02228;inference=COORDINATES:profile:Infernal:001001;locus_tag=CFBP2044_00380;product=sX9
This would be a sample of the GFF file with the first five CDS after agat_sp_fix_features_locations_duplicated.pl
:
##gff-version 3
##sequence-region gnl|ZW|CFBP2044_1 1 5079002
gnl|ZW|CFBP2044_1 prokka gene 1 1329 . + . ID=nbis-gene-1;Name=dnaA;gene=dnaA;locus_tag=CFBP2044_00010
gnl|ZW|CFBP2044_1 prokka mRNA 1 1329 . + . ID=CFBP2044_00010_gene;Parent=nbis-gene-1;Name=dnaA;gene=dnaA;locus_tag=CFBP2044_00010
gnl|ZW|CFBP2044_1 Prodigal:002006 exon 1 1329 . + . ID=nbis-exon-7325;Parent=CFBP2044_00010_gene;Name=dnaA;db_xref=COG:COG0593;gene=dnaA;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P03004;locus_tag=CFBP2044_00010;product=Chromosomal replication initiator protein DnaA;protein_id=gnl|ZW|CFBP2044_00010
gnl|ZW|CFBP2044_1 Prodigal:002006 CDS 1 1329 . + 0 ID=CFBP2044_00010;Parent=CFBP2044_00010_gene;Name=dnaA;db_xref=COG:COG0593;gene=dnaA;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P03004;locus_tag=CFBP2044_00010;product=Chromosomal replication initiator protein DnaA;protein_id=gnl|ZW|CFBP2044_00010
gnl|ZW|CFBP2044_1 prokka gene 1607 2707 . + . ID=CFBP2044_00020_gene;Name=dnaN;gene=dnaN;locus_tag=CFBP2044_00020
gnl|ZW|CFBP2044_1 prokka mRNA 1607 2707 . + . ID=CFBP2044_00020_mRNA;Parent=CFBP2044_00020_gene;Name=dnaN;gene=dnaN;locus_tag=CFBP2044_00020
gnl|ZW|CFBP2044_1 prokka exon 1607 2707 . + . ID=nbis-exon-2;Parent=CFBP2044_00020_mRNA;Name=dnaN;gene=dnaN;locus_tag=CFBP2044_00020
gnl|ZW|CFBP2044_1 Prodigal:002006 CDS 1607 2707 . + 0 ID=CFBP2044_00020;Parent=CFBP2044_00020_mRNA;Name=dnaN;db_xref=COG:COG0592;gene=dnaN;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:Q9I7C4;locus_tag=CFBP2044_00020;product=Beta sliding clamp;protein_id=gnl|ZW|CFBP2044_00020
gnl|ZW|CFBP2044_1 prokka gene 3433 4539 . + . ID=nbis-gene-3;Name=recF;gene=recF;locus_tag=CFBP2044_00030
gnl|ZW|CFBP2044_1 prokka mRNA 3433 4539 . + . ID=CFBP2044_00030_gene;Parent=nbis-gene-3;Name=recF;gene=recF;locus_tag=CFBP2044_00030
gnl|ZW|CFBP2044_1 Prodigal:002006 exon 3433 4539 . + . ID=nbis-exon-6708;Parent=CFBP2044_00030_gene;Name=recF;db_xref=COG:COG1195;gene=recF;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P0A7H0;locus_tag=CFBP2044_00030;product=DNA replication and repair protein RecF;protein_id=gnl|ZW|CFBP2044_00030
gnl|ZW|CFBP2044_1 Prodigal:002006 CDS 3433 4539 . + 0 ID=CFBP2044_00030;Parent=CFBP2044_00030_gene;Name=recF;db_xref=COG:COG1195;gene=recF;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P0A7H0;locus_tag=CFBP2044_00030;product=DNA replication and repair protein RecF;protein_id=gnl|ZW|CFBP2044_00030
gnl|ZW|CFBP2044_1 prokka gene 4654 7098 . + . ID=nbis-gene-4;Name=gyrB;gene=gyrB;locus_tag=CFBP2044_00040
gnl|ZW|CFBP2044_1 prokka mRNA 4654 7098 . + . ID=CFBP2044_00040_gene;Parent=nbis-gene-4;Name=gyrB;gene=gyrB;locus_tag=CFBP2044_00040
gnl|ZW|CFBP2044_1 Prodigal:002006 exon 4654 7098 . + . ID=nbis-exon-7118;Parent=CFBP2044_00040_gene;Name=gyrB;db_xref=COG:COG0187;eC_number=5.6.2.2;gene=gyrB;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P0A2I3;locus_tag=CFBP2044_00040;product=DNA gyrase subunit B;protein_id=gnl|ZW|CFBP2044_00040
gnl|ZW|CFBP2044_1 Prodigal:002006 CDS 4654 7098 . + 0 ID=CFBP2044_00040;Parent=CFBP2044_00040_gene;Name=gyrB;db_xref=COG:COG0187;eC_number=5.6.2.2;gene=gyrB;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P0A2I3;locus_tag=CFBP2044_00040;product=DNA gyrase subunit B;protein_id=gnl|ZW|CFBP2044_00040
gnl|ZW|CFBP2044_1 prokka gene 7167 8003 . + . ID=nbis-gene-5;locus_tag=CFBP2044_00050
gnl|ZW|CFBP2044_1 prokka mRNA 7167 8003 . + . ID=CFBP2044_00050_gene;Parent=nbis-gene-5;locus_tag=CFBP2044_00050
gnl|ZW|CFBP2044_1 Prodigal:002006 exon 7167 8003 . + . ID=nbis-exon-7342;Parent=CFBP2044_00050_gene;inference=ab initio prediction:Prodigal:002006;locus_tag=CFBP2044_00050;product=hypothetical protein;protein_id=gnl|ZW|CFBP2044_00050
gnl|ZW|CFBP2044_1 Prodigal:002006 CDS 7167 8003 . + 0 ID=CFBP2044_00050;Parent=CFBP2044_00050_gene;inference=ab initio prediction:Prodigal:002006;locus_tag=CFBP2044_00050;product=hypothetical protein;protein_id=gnl|ZW|CFBP2044_00050
And here an example of the first problematic feature:
gnl|ZW|CFBP2044_1 prokka gene 47283 47358 . - . ID=CFBP2044_00380_gene;locus_tag=CFBP2044_00380
gnl|ZW|CFBP2044_1 Infernal:001001 misc_RNA 47283 47358 68.8 - . ID=CFBP2044_00380;Parent=CFBP2044_00380_gene,CFBP2044_00380_mRNA;Note="Xanthomonas sRNA sX9";accession=RF02228;inference=COORDINATES:profile:Infernal:001001;locus_tag=CFBP2044_00380;product=sX9
gnl|ZW|CFBP2044_1 prokka mRNA 47283 47358 . - . ID=CFBP2044_00380_mRNA;Parent=CFBP2044_00380_gene;locus_tag=CFBP2044_00380
gnl|ZW|CFBP2044_1 prokka exon 47283 47358 . - . ID=nbis-exon-38;Parent=CFBP2044_00380_mRNA;locus_tag=CFBP2044_00380
Thanks in advance
Ok the syntax of the two samples sounds fine, I was wondering if you would have used Prokka with extra parameters that can mess-up the gff.
You show 2 problematic features (I would call it records, several features linked to each other). In the second case the problem is quite clear, you have misc_RNA and mRNA same locations hold by a same gene. Only one of them should be kept. We should check if it is AGAT that introduce the mRNA. Could you show all features at this location 47283 47358
before running AGAT?
Sorry the first problematic record I provided for the original file was the wrong locus tag. I have now edited my post to show all features at this location. The mRNA is already introduced by Prokka. Thanks
Interesting, the problem is related to Prokka then, it should not provide both misc_RNA and mRNA for the same location, it should only define one. You will have to edit the file manually to remove duplicated features.
OK thanks a lot for the speedy reply. I reported this problem on the Prokka Github https://github.com/tseemann/prokka/issues/506
I have the same issue. When I try to validate the embl file processed by EMBLmyGFF3 I have a lot of those duplicated locations. Sorry for my question that is not directly related to EMBLmyGFF3 . I'm desperately searching for the gxf_to_gff3.pl file. I can't seem to find it in the GAAS Toolkit....
Thank you for your understanding and assistance.
Right it is now called agat_convert_sp_gxf2gxf.pl and is available in AGAT. Tjhe other script is Called agat_sp_fix_features_locations_duplicated.pl and available in AGAT too.
I believe I managed to encounter all the problems described in this threads ! :)
I resolved them all thanks to your answers @Juke34. Thanks a lot !
I still have one issue. I have several lines that looks like this one:
ERROR: Invalid amino acid "l" in translation. [ line: 1630948 of My_Org_noexons.embl.gz]
I am using webin-cli-6.9.0 to validate the embl file I generated with EMBLmyGFF3.
I am trying tio validate an ascidian genomic assembly with several scaffolds (assembled with HiFiasm). One of them is a mitochondrial one (assembled with mitoHiFi). I've manually changed embl file
/transl_table=1 to /transl_table=13
for all mitochondrial CDS.
For exemle I have this in embl file :+1:
FT gene 130..471
FT /locus_tag="CVLEPA_LOCUS11895"
FT /note="ID:Cvlepa.mt.BANY2021.S176.g032142"
FT /note="source:mitoHiFi"
FT /standard_name="ND3"
FT mRNA 130..471
FT /locus_tag="CVLEPA_LOCUS11895"
FT /note="ID:Cvlepa.mt.BANY2021.S176.g032142.01.t"
FT /note="source:mitoHiFi"
FT CDS <130..>471
FT /codon_start=2
FT /locus_tag="CVLEPA_LOCUS11895"
FT /note="ID:Cvlepa.mt.BANY2021.S176.g032142.01.p.cds"
FT /note="source:mitoHiFi"
FT /standard_name="ND3"
FT /transl_table=13
FT /translation="length.114"
transl_table=13 is for "The Ascidian Mitochondrial Code")
I still have that Invalid amino acid "l" message. Not very sure everything I am doing is legit.
Any suggestion will be greatly appreciated.
Thanks !
/translation="length.114"
is wrong it is supposed to be an amino acid String. I guess the problem is already in your GFF file.
The translation is not mandatory. You can remove it when processing with EMBLmyGFF3. Or you can populate it with --translate option from EMBLmyGFF3.
Oh ! Thanks !
I should have read embl file specs :/
No worries, specs are particularly verbose...
Thanks for this nice tool. I'm running into an issue trying to validate embl files that were generated on your tool. I'm using webin-cli-1.7.1 and it throws up the below error when I try to validate/submit the embl files
ERROR: "tRNA" Features locations are duplicated - consider merging qualifiers.
The command-line I used is this:
EMBLmyGFF3 test/6666666.419437.gff test/6666666.419437.contigs.fa -o test/test_new.embl
Any help in this regard would be highly appreciated