NBISweden / AGAT

Another Gtf/Gff Analysis Toolkit
GNU General Public License v3.0
467 stars 56 forks source link

Guidance to convert GTF to GFF #421

Closed Juke34 closed 9 months ago

Juke34 commented 9 months ago

From an user...

Here an example of my GTF:

scaffold_1  JGI exon    881 1130    .   +   .   name "CE1_30"; transcriptId 258
scaffold_1  JGI exon    1217    2499    .   +   .   name "CE1_30"; transcriptId 258
scaffold_1  JGI CDS 1318    2286    .   +   0   name "CE1_30"; proteinId 2; exonNumber 1
scaffold_1  JGI start_codon 1318    1320    .   +   0   name "CE1_30"
scaffold_1  JGI stop_codon  2284    2286    .   +   0   name "CE1_30"
scaffold_1  JGI exon    3466    3999    .   +   .   name "fgenesh1_kg.1_#_2_#_TRINITY_DN906_c0_g1_i1"; transcriptId 183672
scaffold_1  JGI CDS 3948    3999    .   +   0   name "fgenesh1_kg.1_#_2_#_TRINITY_DN906_c0_g1_i1"; proteinId 183416; exonNumber 1
scaffold_1  JGI start_codon 3948    3950    .   +   0   name "fgenesh1_kg.1_#_2_#_TRINITY_DN906_c0_g1_i1"
scaffold_1  JGI exon    4069    4467    .   +   .   name "fgenesh1_kg.1_#_2_#_TRINITY_DN906_c0_g1_i1"; transcriptId 183672
scaffold_1  JGI CDS 4069    4467    .   +   2   name "fgenesh1_kg.1_#_2_#_TRINITY_DN906_c0_g1_i1"; proteinId 183416; exonNumber 2
scaffold_1  JGI exon    4526    4603    .   +   .   name "fgenesh1_kg.1_#_2_#_TRINITY_DN906_c0_g1_i1"; transcriptId 183672
scaffold_1  JGI CDS 4526    4603    .   +   2   name "fgenesh1_kg.1_#_2_#_TRINITY_DN906_c0_g1_i1"; proteinId 183416; exonNumber 3
scaffold_1  JGI exon    4676    4733    .   +   .   name "fgenesh1_kg.1_#_2_#_TRINITY_DN906_c0_g1_i1"; transcriptId 183672
scaffold_1  JGI CDS 4676    4733    .   +   2   name "fgenesh1_kg.1_#_2_#_TRINITY_DN906_c0_g1_i1"; proteinId 183416; exonNumber 4
scaffold_1  JGI exon    4816    5089    .   +   .   name "fgenesh1_kg.1_#_2_#_TRINITY_DN906_c0_g1_i1"; transcriptId 183672
scaffold_1  JGI CDS 4816    4996    .   +   1   name "fgenesh1_kg.1_#_2_#_TRINITY_DN906_c0_g1_i1"; proteinId 183416; exonNumber 5
scaffold_1  JGI stop_codon  4994    4996    .   +   0   name "fgenesh1_kg.1_#_2_#_TRINITY_DN906_c0_g1_i1"
scaffold_1  JGI exon    5866    7045    .   -   .   name "MIX1_22_76"; transcriptId 232061
scaffold_1  JGI CDS 5866    7045    .   -   1   name "MIX1_22_76"; proteinId 231805; exonNumber 2
scaffold_1  JGI stop_codon  5866    5868    .   -   0   name "MIX1_22_76"
scaffold_1  JGI exon    7100    7521    .   -   .   name "MIX1_22_76"; transcriptId 232061
scaffold_1  JGI CDS 7100    7521    .   -   0   name "MIX1_22_76"; proteinId 231805; exonNumber 1
scaffold_1  JGI start_codon 7519    7521    .   -   0   name "MIX1_22_76"
scaffold_1  JGI exon    8869    9571    .   -   .   name "MIX4_47_93"; transcriptId 232064
scaffold_1  JGI CDS 8924    9571    .   -   0   name "MIX4_47_93"; proteinId 231808; exonNumber 3
scaffold_1  JGI stop_codon  8924    8926    .   -   0   name "MIX4_47_93"
scaffold_1  JGI exon    9630    9910    .   -   .   name "MIX4_47_93"; transcriptId 232064
scaffold_1  JGI CDS 9630    9910    .   -   2   name "MIX4_47_93"; proteinId 231808; exonNumber 2
scaffold_1  JGI exon    9978    10572   .   -   .   name "MIX4_47_93"; transcriptId 232064
scaffold_1  JGI CDS 9978    10074   .   -   0   name "MIX4_47_93"; proteinId 231808; exonNumber 1
scaffold_1  JGI start_codon 10072   10074   .   -   0   name "MIX4_47_93"
scaffold_1  JGI exon    11810   12118   .   +   .   name "MIX8_23_76"; transcriptId 232068
scaffold_1  JGI CDS 12016   12118   .   +   0   name "MIX8_23_76"; proteinId 231812; exonNumber 1
scaffold_1  JGI start_codon 12016   12018   .   +   0   name "MIX8_23_76"
scaffold_1  JGI exon    12187   12431   .   +   .   name "MIX8_23_76"; transcriptId 232068
scaffold_1  JGI CDS 12187   12431   .   +   2   name "MIX8_23_76"; proteinId 231812; exonNumber 2
scaffold_1  JGI exon    12488   12937   .   +   .   name "MIX8_23_76"; transcriptId 232068
scaffold_1  JGI CDS 12488   12937   .   +   0   name "MIX8_23_76"; proteinId 231812; exonNumber 3
scaffold_1  JGI exon    13024   13602   .   +   .   name "MIX8_23_76"; transcriptId 232068
scaffold_1  JGI CDS 13024   13602   .   +   0   name "MIX8_23_76"; proteinId 231812; exonNumber 4
scaffold_1  JGI stop_codon  13600   13602   .   +   0   name "MIX8_23_76"
scaffold_1  JGI exon    13419   14510   .   -   .   name "fgenesh1_kg.1_#_7_#_TRINITY_DN851_c0_g1_i1"; transcriptId 183677
scaffold_1  JGI CDS 13676   14401   .   -   0   name "fgenesh1_kg.1_#_7_#_TRINITY_DN851_c0_g1_i1"; proteinId 183421; exonNumber 1
scaffold_1  JGI start_codon 14399   14401   .   -   0   name "fgenesh1_kg.1_#_7_#_TRINITY_DN851_c0_g1_i1"
scaffold_1  JGI stop_codon  13676   13678   .   -   0   name "fgenesh1_kg.1_#_7_#_TRINITY_DN851_c0_g1_i1"

Here is the GFF output:

##gtf-version X
# GFF-like GTF i.e. not checked against any GTF specification. Conversion based on GFF input, standardised by AGAT.
scaffold_1  JGI gene    881 14510   .   +   .   gene_id "nbis-gene-2"; ID "nbis-gene-2"; name "CE1_30"; transcriptId "258";
scaffold_1  JGI mRNA    881 14510   .   +   .   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "nbis-rna-1"; Parent "nbis-gene-2"; name "CE1_30"; transcriptId "258";
scaffold_1  JGI exon    881 1130    .   +   .   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "exon-1"; Parent "nbis-rna-1"; name "CE1_30"; transcriptId "258";
scaffold_1  JGI exon    1217    2499    .   +   .   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "exon-2"; Parent "nbis-rna-1"; name "CE1_30"; transcriptId "258";
scaffold_1  JGI exon    3466    3999    .   +   .   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "exon-3"; Parent "nbis-rna-1"; name "fgenesh1_kg.1_#_2_#_TRINITY_DN906_c0_g1_i1"; transcriptId "183672";
scaffold_1  JGI exon    4069    4467    .   +   .   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "exon-4"; Parent "nbis-rna-1"; name "fgenesh1_kg.1_#_2_#_TRINITY_DN906_c0_g1_i1"; transcriptId "183672";
scaffold_1  JGI exon    4526    4603    .   +   .   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "exon-5"; Parent "nbis-rna-1"; name "fgenesh1_kg.1_#_2_#_TRINITY_DN906_c0_g1_i1"; transcriptId "183672";
scaffold_1  JGI exon    4676    4733    .   +   .   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "exon-6"; Parent "nbis-rna-1"; name "fgenesh1_kg.1_#_2_#_TRINITY_DN906_c0_g1_i1"; transcriptId "183672";
scaffold_1  JGI exon    4816    5089    .   +   .   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "exon-7"; Parent "nbis-rna-1"; name "fgenesh1_kg.1_#_2_#_TRINITY_DN906_c0_g1_i1"; transcriptId "183672";
scaffold_1  JGI exon    5866    7045    .   -   .   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "exon-8"; Parent "nbis-rna-1"; name "MIX1_22_76"; transcriptId "232061";
scaffold_1  JGI exon    7100    7521    .   -   .   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "exon-9"; Parent "nbis-rna-1"; name "MIX1_22_76"; transcriptId "232061";
scaffold_1  JGI exon    8869    9571    .   -   .   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "exon-10"; Parent "nbis-rna-1"; name "MIX4_47_93"; transcriptId "232064";
scaffold_1  JGI exon    9630    9910    .   -   .   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "exon-11"; Parent "nbis-rna-1"; name "MIX4_47_93"; transcriptId "232064";
scaffold_1  JGI exon    9978    10572   .   -   .   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "exon-12"; Parent "nbis-rna-1"; name "MIX4_47_93"; transcriptId "232064";
scaffold_1  JGI exon    11810   12118   .   +   .   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "exon-13"; Parent "nbis-rna-1"; name "MIX8_23_76"; transcriptId "232068";
scaffold_1  JGI exon    12187   12431   .   +   .   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "exon-14"; Parent "nbis-rna-1"; name "MIX8_23_76"; transcriptId "232068";
scaffold_1  JGI exon    12488   12937   .   +   .   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "exon-15"; Parent "nbis-rna-1"; name "MIX8_23_76"; transcriptId "232068";
scaffold_1  JGI exon    13024   14510   .   +   .   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "exon-16"; Parent "nbis-rna-1"; name "MIX8_23_76"; transcriptId "232068";
scaffold_1  JGI CDS 1318    2286    .   +   0   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "cds-1"; Parent "nbis-rna-1"; exonNumber "1"; name "CE1_30"; proteinId "2";
scaffold_1  JGI CDS 3948    3999    .   +   0   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "cds-2"; Parent "nbis-rna-1"; exonNumber "1"; name "fgenesh1_kg.1_#_2_#_TRINITY_DN906_c0_g1_i1"; proteinId "183416";
scaffold_1  JGI CDS 4069    4467    .   +   2   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "cds-3"; Parent "nbis-rna-1"; exonNumber "2"; name "fgenesh1_kg.1_#_2_#_TRINITY_DN906_c0_g1_i1"; proteinId "183416";
scaffold_1  JGI CDS 4526    4603    .   +   2   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "cds-4"; Parent "nbis-rna-1"; exonNumber "3"; name "fgenesh1_kg.1_#_2_#_TRINITY_DN906_c0_g1_i1"; proteinId "183416";
scaffold_1  JGI CDS 4676    4733    .   +   2   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "cds-5"; Parent "nbis-rna-1"; exonNumber "4"; name "fgenesh1_kg.1_#_2_#_TRINITY_DN906_c0_g1_i1"; proteinId "183416";
scaffold_1  JGI CDS 4816    4996    .   +   1   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "cds-6"; Parent "nbis-rna-1"; exonNumber "5"; name "fgenesh1_kg.1_#_2_#_TRINITY_DN906_c0_g1_i1"; proteinId "183416";
scaffold_1  JGI CDS 5866    7045    .   -   1   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "cds-7"; Parent "nbis-rna-1"; exonNumber "2"; name "MIX1_22_76"; proteinId "231805";
scaffold_1  JGI CDS 7100    7521    .   -   0   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "cds-8"; Parent "nbis-rna-1"; exonNumber "1"; name "MIX1_22_76"; proteinId "231805";
scaffold_1  JGI CDS 8924    9571    .   -   0   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "cds-9"; Parent "nbis-rna-1"; exonNumber "3"; name "MIX4_47_93"; proteinId "231808";
scaffold_1  JGI CDS 9630    9910    .   -   2   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "cds-10"; Parent "nbis-rna-1"; exonNumber "2"; name "MIX4_47_93"; proteinId "231808";
scaffold_1  JGI CDS 9978    10074   .   -   0   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "cds-11"; Parent "nbis-rna-1"; exonNumber "1"; name "MIX4_47_93"; proteinId "231808";
scaffold_1  JGI CDS 12016   12118   .   +   0   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "cds-12"; Parent "nbis-rna-1"; exonNumber "1"; name "MIX8_23_76"; proteinId "231812";
scaffold_1  JGI CDS 12187   12431   .   +   2   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "cds-13"; Parent "nbis-rna-1"; exonNumber "2"; name "MIX8_23_76"; proteinId "231812";
scaffold_1  JGI CDS 12488   12937   .   +   0   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "cds-14"; Parent "nbis-rna-1"; exonNumber "3"; name "MIX8_23_76"; proteinId "231812";
scaffold_1  JGI CDS 13024   13602   .   +   0   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "cds-15"; Parent "nbis-rna-1"; exonNumber "4"; name "MIX8_23_76"; proteinId "231812";
scaffold_1  JGI CDS 13676   14401   .   -   0   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "cds-16"; Parent "nbis-rna-1"; exonNumber "1"; name "fgenesh1_kg.1_#_7_#_TRINITY_DN851_c0_g1_i1"; proteinId "183421";
scaffold_1  JGI five_prime_UTR  881 1130    .   +   .   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "nbis-five_prime_utr-1"; Parent "nbis-rna-1"; name "CE1_30"; transcriptId "258";
scaffold_1  JGI five_prime_UTR  1217    1317    .   +   .   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "nbis-five_prime_utr-1"; Parent "nbis-rna-1"; name "CE1_30"; transcriptId "258";
scaffold_1  JGI start_codon 1318    1320    .   +   0   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "start_codon-1"; Parent "nbis-rna-1"; name "CE1_30";
scaffold_1  JGI start_codon 3948    3950    .   +   0   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "start_codon-2"; Parent "nbis-rna-1"; name "fgenesh1_kg.1_#_2_#_TRINITY_DN906_c0_g1_i1";
scaffold_1  JGI start_codon 7519    7521    .   -   0   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "start_codon-3"; Parent "nbis-rna-1"; name "MIX1_22_76";
scaffold_1  JGI start_codon 10072   10074   .   -   0   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "start_codon-4"; Parent "nbis-rna-1"; name "MIX4_47_93";
scaffold_1  JGI start_codon 12016   12018   .   +   0   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "start_codon-5"; Parent "nbis-rna-1"; name "MIX8_23_76";
scaffold_1  JGI start_codon 14399   14401   .   -   0   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "start_codon-6"; Parent "nbis-rna-1"; name "fgenesh1_kg.1_#_7_#_TRINITY_DN851_c0_g1_i1";
scaffold_1  JGI stop_codon  2284    2286    .   +   0   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "stop_codon-1"; Parent "nbis-rna-1"; name "CE1_30";
scaffold_1  JGI stop_codon  4994    4996    .   +   0   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "stop_codon-2"; Parent "nbis-rna-1"; name "fgenesh1_kg.1_#_2_#_TRINITY_DN906_c0_g1_i1";
scaffold_1  JGI stop_codon  5866    5868    .   -   0   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "stop_codon-3"; Parent "nbis-rna-1"; name "MIX1_22_76";
scaffold_1  JGI stop_codon  8924    8926    .   -   0   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "stop_codon-4"; Parent "nbis-rna-1"; name "MIX4_47_93";
scaffold_1  JGI stop_codon  13600   13602   .   +   0   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "stop_codon-5"; Parent "nbis-rna-1"; name "MIX8_23_76";
scaffold_1  JGI stop_codon  13676   13678   .   -   0   gene_id "nbis-gene-2"; transcript_id "nbis-rna-1"; ID "stop_codon-6"; Parent "nbis-rna-1"; name "fgenesh1_kg.1_#_7_#_TRINITY_DN851_c0_g1_i1";

Everything is collected under the same gene and transcript umbrella...

Could you help me to understand what is going on?

Juke34 commented 9 months ago

The problem is not related to the conversion itself but rather the input file processing... and the poorly formatted format. No worries, it is possible to correct that with AGAT but let' first explain what is going on:

There are in theory 3 type of feature, feature level1 (e.g. gene), feature level2 (e.g. mRNA) and feature level3 (CDS, exon, start codon, stop codon). Features can be linked to each other to represent a record (e.g exon/CDS, mRNA and gene of a locus) The relationship between features representing a record is made via ID/Parent attribute in GFF and gene_id/transcript_id attribute in GTF. Your file does not follow these specifications.

1) In your input file have only features level3 2) You do not have proper relationship (ID/Parent attribute in GFF or gene_id/transcript_id)

i)AGAT while parsing try to use ID/Parent or gene_id/transcript_id attributes, but it fails. ii) Then it tries to instead use a shared attribute allowing to group features. What attribute? It is specified at the beginning of the log

=> Attribute used to group features when no Parent/ID relationship exists (i.e common tag):
    * locus_tag
    * gene_id

Unfortunaly none of these attributes is in your file. So it fails iii) Then AGAT will parse the file sequentially. Each level3 feature will be attached to the last Level2 feature encountered and each Level2 feature will be attached to the last Level1 feature encountered. Unfortunate you do not have level1 and level2 feature.... So what happens? AGAT creates on the fly a level1 feature (gene) and a level2 feature (mRNA) and attach everything to them... but it is not what you want to achieve.

Solution: If you know your species and your file you should know if your file contains isoforms. This information is important. Now modifying the config file you can tell AGAT to use another attribute to group features properly. e.g. transcriptId Using this attribute will work well but will not allow isoform... why? Because you will get one gene by transcript... No way to collect several transcripts under the same gene, because this information is missing. You might use the name attribute to group several transcript will the same name under a same gene feature. But In that case you might be sure that no name is share between transcript that do not have any link (e.g. on different chromosome, etc...)