NBISweden / AGAT

Another Gtf/Gff Analysis Toolkit
GNU General Public License v3.0
431 stars 52 forks source link

agat_convert_sp_gff2gtf.pl generates errors #427

Closed fpichon closed 4 months ago

fpichon commented 4 months ago

Describe the bug When using agat_convert_sp_gff2gtf.pl, my output GTF file largely differs from the input GFF file.

General (please complete the following information):

To Reproduce agat_convert_sp_gff2gtf.pl --gff /data/nextpolish_polished_assembly_annotated.gff3 -o /data/nextpolish_polished_assembly_annotated.gtf

Yeast assembly obtained with nextdenovo and polished with nextpolish.

Expected behavior Same output features than in input.

Observed behavior New transcripts appeared, very long that span the whole chromosome (e.g. HPF1, EFT1), and EFT1 as end coordinate higher than chromosome end... so program downstream crash.

Screenshots Left arm end (first 2 top tracks are GFF, bottom track is new GTF): image

Right arm end (first 2 top tracks are GFF, bottom track is new GTF): image

Additional context Genome is yeast, it might be that same gene is found in another chromosome and mess is done between different copies ?

Juke34 commented 4 months ago

It might happen when default information is missing in the GFF file and AGAT is not able to group feature together with default behavior. Using more recent version of AGAT would help to identify such type of case (information would be among the few first output lines).

It might also be the case if unique Identifier are used to specify different genes on different chromosome (what is strictly forbidden by the format specificaitons).

Could send over the file or example that I could look at more closely?

fpichon commented 4 months ago

The GFF file was generated by Augustus, but I then used a script to replace genes predictions by genes names annotation. I admit I didn't take into account the case when two different genes have same names, though. I will correct my script.

In any case, I prefer not share my file, but I can share the particular exemple with EFT1. Indeed, we can see that two genes located in different chromosomes were fused. Here is the original GFF file:

ctg000020_np1212        AUGUSTUS        gene    489924  492452  0.96    -       .       ID=EFT1  

ctg000020_np1212        AUGUSTUS        transcript      489924  492452  0.96    -       .       ID=EFT1.t1;Parent=EFT1

ctg000020_np1212        AUGUSTUS        stop_codon      489924  489926  .       -       0       Parent=EFT1.t1       

ctg000020_np1212        AUGUSTUS        CDS     489927  492452  0.96    -       0       ID=EFT1.t1.cds;Parent=EFT1.t1

ctg000020_np1212        AUGUSTUS        start_codon     492450  492452  .       -       0       Parent=EFT1.t1       

ctg000030_np1212        AUGUSTUS        gene    1190115 1192643 1       +       .       ID=EFT1                      

ctg000030_np1212        AUGUSTUS        transcript      1190115 1192643 1       +       .       ID=EFT1.t1;Parent=EFT1                                                                                                                    

ctg000030_np1212        AUGUSTUS        start_codon     1190115 1190117 .       +       0       Parent=EFT1.t1       

ctg000030_np1212        AUGUSTUS        CDS     1190115 1192640 1       +       0       ID=EFT1.t1.cds;Parent=EFT1.t1

ctg000030_np1212        AUGUSTUS        stop_codon      1192641 1192643 .       +       0       Parent=EFT1.t1

That becomes with AGAT:

ctg000020_np1212        AUGUSTUS        gene    489924  1192643 0.96    -       .       gene_id "EFT1"; ID "EFT1";   

ctg000020_np1212        AUGUSTUS        transcript      489924  1192643 0.96    -       .       gene_id "EFT1"; transcript_id "EFT1.t1"; ID "EFT1.t1"; Parent "EFT1";                                                                     

ctg000020_np1212        AUGUSTUS        exon    489924  492452  0.96    -       .       gene_id "EFT1"; transcript_id "EFT1.t1"; ID "nbis-exon-1125"; Parent "EFT1.t1";                                                                   

ctg000020_np1212        AUGUSTUS        exon    1190115 1192643 0.96    -       .       gene_id "EFT1"; transcript_id "EFT1.t1"; ID "nbis-exon-1126"; Parent "EFT1.t1";                                                                   

ctg000020_np1212        AUGUSTUS        CDS     489927  492452  0.96    -       0       gene_id "EFT1"; transcript_id "EFT1.t1"; ID "EFT1.t1.cds"; Parent "EFT1.t1";                                                                      

ctg000030_np1212        AUGUSTUS        CDS     1190115 1192640 1       +       0       gene_id "EFT1"; transcript_id "EFT1.t1"; ID "EFT1.t1.cds"; Parent "EFT1.t1";                                                                      

ctg000020_np1212        AUGUSTUS        five_prime_utr  1192641 1192643 0.96    -       .       gene_id "EFT1"; transcript_id "EFT1.t1"; ID "nbis-five_prime_utr-5"; Parent "EFT1.t1"; original_biotype "five_prime_UTR";                 

ctg000020_np1212        AUGUSTUS        start_codon     492450  492452  .       -       0       gene_id "EFT1"; transcript_id "EFT1.t1"; ID "start_codon-2028"; Parent "EFT1.t1";                                                         

ctg000030_np1212        AUGUSTUS        start_codon     1190115 1190117 .       +       0       gene_id "EFT1"; transcript_id "EFT1.t1"; ID "start_codon-4963"; Parent "EFT1.t1";                                                         

ctg000020_np1212        AUGUSTUS        stop_codon      489924  489926  .       -       0       gene_id "EFT1"; transcript_id "EFT1.t1"; ID "stop_codon-2027"; Parent "EFT1.t1";                                                          

ctg000030_np1212        AUGUSTUS        stop_codon      1192641 1192643 .       +       0       gene_id "EFT1"; transcript_id "EFT1.t1"; ID "stop_codon-4962"; Parent "EFT1.t1";                                                          

ctg000020_np1212        AUGUSTUS        three_prime_utr 489924  489926  0.96    -       .       gene_id "EFT1"; transcript_id "EFT1.t1"; ID "nbis-three_prime_utr-16"; Parent "EFT1.t1"; original_biotype "three_prime_UTR";              

ctg000030_np1212        AUGUSTUS        transcript      1190115 1192643 1       +       .       gene_id "EFT1"; transcript_id "nbis-transcript-145"; ID "nbis-transcript-145"; Parent "EFT1";

Thanks for your fast answer and clarification ! :)

EDIT: I also saw this thread https://github.com/Gaius-Augustus/Augustus/issues/74 where they advise not to use --gff3 with Augustus.

Juke34 commented 4 months ago

ID attribute must be unique for most of the features, if you used the name attribute as ID then you probably mess up your file because many genes have same name. (can be on same and/or different contigs). Name must stay in a specific attribute.

fpichon commented 4 months ago

I corrected my script to have unique IDs and output file is now correct after AGAT conversion.