NBISweden / AGAT

Another Gtf/Gff Analysis Toolkit
GNU General Public License v3.0
463 stars 56 forks source link

agat_convert_sp_gxf2gxf.pl on ncbi gff file generating "nbis-gene-xx" genes #366

Closed mossconfuse closed 1 year ago

mossconfuse commented 1 year ago

I am trying to generate a gtf file containing nuclear and plastid genomes. The nuclear genomes are available on Phytozome, and agat_convert_sp_gff2gtf.pl runs without a hitch on them.

The chloroplast (NC_005087) and mitochondrial (KY126309) genomes are only available through ncbi. When I run agat_convert_sp_gff2gtf.pl on these, new ID's are generated for many of the features in a format "nbis-gene-xx".

ChrMT   Genbank gene    5235    7103    .   -   .   gene_id "nbis-gene-1"; ID "nbis-gene-1"; Name "ORF622"; gbkey "Gene"; gene "ORF622"; gene_biotype "protein_coding";
ChrMT   Genbank transcript  5235    7103    .   -   .   gene_id "nbis-gene-1"; transcript_id "gene-ORF622"; ID "gene-ORF622"; Name "ORF622"; Parent "nbis-gene-1"; gbkey "Gene"; gene "ORF622"; gene_biotype "protein_coding"; original_biotype "mrna";
ChrMT   Genbank exon    5235    7103    .   -   .   gene_id "nbis-gene-1"; transcript_id "gene-ORF622"; Dbxref "NCBI_GP:ARI44077.1"; ID "nbis-exon-1"; Name "ARI44077.1"; Parent "gene-ORF622"; gbkey "CDS"; gene "ORF622"; product "hypothetical protein"; protein_id "ARI44077.1";
ChrMT   Genbank CDS 5235    7103    .   -   0   gene_id "nbis-gene-1"; transcript_id "gene-ORF622"; Dbxref "NCBI_GP:ARI44077.1"; ID "cds-ARI44077.1"; Name "ARI44077.1"; Parent "gene-ORF622"; gbkey "CDS"; gene "ORF622"; product "hypothetical protein"; protein_id "ARI44077.1";
ChrMT   Genbank gene    5235    7103    .   -   .   gene_id "nbis-gene-2"; ID "nbis-gene-2"; Name "ORF622"; gbkey "Gene"; gene "ORF622"; gene_biotype "protein_coding";
ChrMT   Genbank transcript  5235    7103    .   -   .   gene_id "nbis-gene-2"; transcript_id "gene-ORF622-2"; ID "gene-ORF622-2"; Name "ORF622"; Parent "nbis-gene-2"; gbkey "Gene"; gene "ORF622"; gene_biotype "protein_coding"; original_biotype "mrna";
ChrMT   Genbank exon    5235    7103    .   -   .   gene_id "nbis-gene-2"; transcript_id "gene-ORF622-2"; Dbxref "NCBI_GP:ARI44077.1"; ID "nbis-exon-2"; Name "ARI44077.1"; Parent "gene-ORF622-2"; gbkey "CDS"; gene "ORF622"; product "hypothetical protein"; protein_id "ARI44077.1";
ChrMT   Genbank CDS 5235    7103    .   -   0   gene_id "nbis-gene-2"; transcript_id "gene-ORF622-2"; Dbxref "NCBI_GP:ARI44077.1"; ID "cds-ARI44077.1-2"; Name "ARI44077.1"; Parent "gene-ORF622-2"; gbkey "CDS"; gene "ORF622"; product "hypothetical protein"; protein_id "ARI44077.1";
ChrMT   Genbank gene    2893    10494   .   -   .   gene_id "nbis-gene-13"; ID "nbis-gene-13"; Name "cox1"; gbkey "Gene"; gene "cox1"; gene_biotype "protein_coding"; locus_tag "Pp3M_20V3";
ChrMT   Genbank transcript  2893    10494   .   -   .   gene_id "nbis-gene-13"; transcript_id "gene-Pp3M_20V3"; ID "gene-Pp3M_20V3"; Name "cox1"; Parent "nbis-gene-13"; gbkey "Gene"; gene "cox1"; gene_biotype "protein_coding"; locus_tag "Pp3M_20V3"; original_biotype "mrna";

These genes already have IDs in the gff file, so I am not sure why this is happening. I tried converting the files to gtf using gffread as well. It works,

(base) gffread PpMT.gff3 -T -o PpMT_gffread.gtf3
Error: discarding overlapping duplicate region feature (1-105340) with ID=ChrMT:1..105340
(base) head PpMT_gffread.gtf3             
ChrMT   Genbank transcript  155 228 .   +   .   transcript_id "rna-Pp3M_1V3"; gene_id "gene-Pp3M_1V3"; gene_name "trnM-CAU"
ChrMT   Genbank exon    155 228 .   +   .   transcript_id "rna-Pp3M_1V3"; gene_id "gene-Pp3M_1V3"; gene_name "trnM-CAU";
ChrMT   Genbank transcript  155 228 .   +   .   transcript_id "rna-Pp3M_1V3-2"; gene_id "gene-Pp3M_1V3-2"; gene_name "trnM-CAU"
ChrMT   Genbank exon    155 228 .   +   .   transcript_id "rna-Pp3M_1V3-2"; gene_id "gene-Pp3M_1V3-2"; gene_name "trnM-CAU";
ChrMT   Genbank transcript  1078    1150    .   +   .   transcript_id "rna-Pp3M_10V3"; gene_id "gene-Pp3M_10V3"; gene_name "trnK-UUU"
ChrMT   Genbank exon    1078    1150    .   +   .   transcript_id "rna-Pp3M_10V3"; gene_id "gene-Pp3M_10V3"; gene_name "trnK-UUU";
ChrMT   Genbank transcript  1078    1150    .   +   .   transcript_id "rna-Pp3M_10V3-2"; gene_id "gene-Pp3M_10V3-2"; gene_name "trnK-UUU"
ChrMT   Genbank exon    1078    1150    .   +   .   transcript_id "rna-Pp3M_10V3-2"; gene_id "gene-Pp3M_10V3-2"; gene_name "trnK-UUU";
ChrMT   Genbank transcript  2893    10494   .   -   .   transcript_id "gene-Pp3M_20V3"; gene_id "gene-Pp3M_20V3"; gene_name "cox1"
ChrMT   Genbank exon    2893    3397    .   -   .   transcript_id "gene-Pp3M_20V3"; gene_name "cox1";

I tried to convert them back to gff using agat so I could concatenate them with the phytozome gff and then run agat_convert_sp_gff2gtf.pl on them all together, but this reintroduces these "nbis" names. Is there a workaround, or a simple parameter that I need to adjust to fix this?

Juke34 commented 1 year ago

What AGAT version are you using? I think it has been fixed in recent release

B10inform commented 1 year ago

Hi Juke,

I am using version 1.0.0. agat_sp_filter_feature_from_kill_list.pl is doing the same. It adds new nbis-exons. The output is larger than the input. Any help would be really great. Thanks

Scaffold_1 maker gene 3492427 3498117 . + . ID=XP_08099;Name=XP_08099; Scaffold_1 maker mRNA 3492427 3498117 . + . ID=XP_08099-RA;Parent=XP_08099; Scaffold_1 maker exon 3492427 3492994 . + . ID=nbis-exon-16589;Parent=XP_08099-RA Scaffold_1 maker exon 3493502 3493629 . + . ID=nbis-exon-16590;Parent=XP_08099-RA Scaffold_1 maker exon 3493759 3493830 . + . ID=nbis-exon-16591;Parent=XP_08099-RA Scaffold_1 maker exon 3493903 3494016 . + . ID=nbis-exon-16592;Parent=XP_08099-RA Scaffold_1 maker exon 3494393 3494461 . + . ID=nbis-exon-16593;Parent=XP_08099-RA Scaffold_1 maker exon 3494554 3494610 . + . ID=nbis-exon-16594;Parent=XP_08099-RA Scaffold_1 maker exon 3495358 3495411 . + . ID=nbis-exon-16595;Parent=XP_08099-RA Scaffold_1 maker exon 3495485 3495547 . + . ID=nbis-exon-16596;Parent=XP_08099-RA Scaffold_1 maker exon 3495708 3495770 . + . ID=nbis-exon-16597;Parent=XP_08099-RA Scaffold_1 maker exon 3496121 3496285 . + . ID=nbis-exon-16598;Parent=XP_08099-RA Scaffold_1 maker exon 3496364 3496519 . + . ID=nbis-exon-16599;Parent=XP_08099-RA Scaffold_1 maker exon 3496827 3496907 . + . ID=nbis-exon-16600;Parent=XP_08099-RA Scaffold_1 maker exon 3497023 3497068 . + . ID=XP_08099-RA:17;Parent=XP_08099-RA Scaffold_1 maker exon 3497219 3497322 . + . ID=nbis-exon-16601;Parent=XP_08099-RA Scaffold_1 maker exon 3497648 3498117 . + . ID=nbis-exon-16602;Parent=XP_08099-RA Scaffold_1 maker CDS 3492823 3492994 . + 0 ID=XP_08099-RA:cds;Parent=XP_08099-RA Scaffold_1 maker CDS 3493502 3493629 . + 2 ID=XP_08099-RA:cds;Parent=XP_08099-RA Scaffold_1 maker CDS 3493759 3493830 . + 0 ID=XP_08099-RA:cds;Parent=XP_08099-RA Scaffold_1 maker CDS 3493903 3494016 . + 0 ID=XP_08099-RA:cds;Parent=XP_08099-RA Scaffold_1 maker CDS 3494393 3494461 . + 0 ID=XP_08099-RA:cds;Parent=XP_08099-RA Scaffold_1 maker CDS 3494554 3494610 . + 0 ID=XP_08099-RA:cds;Parent=XP_08099-RA Scaffold_1 maker CDS 3495358 3495411 . + 0 ID=XP_08099-RA:cds;Parent=XP_08099-RA Scaffold_1 maker CDS 3495485 3495547 . + 0 ID=XP_08099-RA:cds;Parent=XP_08099-RA Scaffold_1 maker CDS 3495708 3495770 . + 0 ID=XP_08099-RA:cds;Parent=XP_08099-RA Scaffold_1 maker CDS 3496121 3496285 . + 0 ID=XP_08099-RA:cds;Parent=XP_08099-RA Scaffold_1 maker CDS 3496364 3496519 . + 0 ID=XP_08099-RA:cds;Parent=XP_08099-RA Scaffold_1 maker CDS 3496827 3496907 . + 0 ID=XP_08099-RA:cds;Parent=XP_08099-RA Scaffold_1 maker CDS 3497023 3497068 . + 0 ID=XP_08099-RA:cds;Parent=XP_08099-RA Scaffold_1 maker CDS 3497219 3497322 . + 2 ID=XP_08099-RA:cds;Parent=XP_08099-RA Scaffold_1 maker CDS 3497648 3497689 . + 0 ID=XP_08099-RA:cds;Parent=XP_08099-RA

Juke34 commented 1 year ago

AGAT creates such new exons where information is missing. It deduces exons form CDS and UTRs. If you don't what this behavior you need to modify the global parameter file config.yaml agar config --expose and set the check_exons parameter to false

B10inform commented 1 year ago

Hi Juke34,

Thank you for the information. I also see some other changes. Contig_10 maker gene 122303 128021 . + . ID=IDmodified-gene-20274;Name=XP_4916930;coverage=1.0;sequence_ID=1.0;valid_ORFs=1;;copy_num_ID=IDmodified-gene-20274_0 Contig_10 maker mRNA 122303 128021 . + . ID=IDmodified-mrna-20274;Parent=IDmodified-gene-20274;Name=XP_4916930-RA; Contig_10 maker exon 122303 122814 . + . ID=IDmodified-exon-49587;Parent=IDmodified-mrna-20274; Contig_10 maker exon 127931 128021 . + . ID=IDmodified-exon-49588;Parent=IDmodified-mrna-20274; Contig_10 maker CDS 122303 122814 . + . ID=IDmodified-cds-90256;Parent=IDmodified-mrna-20274; Contig_10 maker CDS 127931 128021 . + . ID=IDmodified-cds-90257;Parent=IDmodified-mrna-20274;

How do i get rid of these addition.

Thnaks

Juke34 commented 1 year ago

I guess this one cannot be avoided, AGAT needs to follow some specifications at some point, and the minimum to deal with GFF/GTF is to have proper relationships between the features. I guess if the IDs seem to be modified it is because it was either missing or wrong (e.g. share between features that are not span over different locations e.g. CDS, or share on features not on same sequences, etc.). To better understand what is going on, you could look at the log or share the the input of that record.