Closed fpichon closed 9 months ago
It might happen when default information is missing in the GFF file and AGAT is not able to group feature together with default behavior. Using more recent version of AGAT would help to identify such type of case (information would be among the few first output lines).
It might also be the case if unique Identifier are used to specify different genes on different chromosome (what is strictly forbidden by the format specificaitons).
Could send over the file or example that I could look at more closely?
The GFF file was generated by Augustus, but I then used a script to replace genes predictions by genes names annotation. I admit I didn't take into account the case when two different genes have same names, though. I will correct my script.
In any case, I prefer not share my file, but I can share the particular exemple with EFT1. Indeed, we can see that two genes located in different chromosomes were fused. Here is the original GFF file:
ctg000020_np1212 AUGUSTUS gene 489924 492452 0.96 - . ID=EFT1
ctg000020_np1212 AUGUSTUS transcript 489924 492452 0.96 - . ID=EFT1.t1;Parent=EFT1
ctg000020_np1212 AUGUSTUS stop_codon 489924 489926 . - 0 Parent=EFT1.t1
ctg000020_np1212 AUGUSTUS CDS 489927 492452 0.96 - 0 ID=EFT1.t1.cds;Parent=EFT1.t1
ctg000020_np1212 AUGUSTUS start_codon 492450 492452 . - 0 Parent=EFT1.t1
ctg000030_np1212 AUGUSTUS gene 1190115 1192643 1 + . ID=EFT1
ctg000030_np1212 AUGUSTUS transcript 1190115 1192643 1 + . ID=EFT1.t1;Parent=EFT1
ctg000030_np1212 AUGUSTUS start_codon 1190115 1190117 . + 0 Parent=EFT1.t1
ctg000030_np1212 AUGUSTUS CDS 1190115 1192640 1 + 0 ID=EFT1.t1.cds;Parent=EFT1.t1
ctg000030_np1212 AUGUSTUS stop_codon 1192641 1192643 . + 0 Parent=EFT1.t1
That becomes with AGAT:
ctg000020_np1212 AUGUSTUS gene 489924 1192643 0.96 - . gene_id "EFT1"; ID "EFT1";
ctg000020_np1212 AUGUSTUS transcript 489924 1192643 0.96 - . gene_id "EFT1"; transcript_id "EFT1.t1"; ID "EFT1.t1"; Parent "EFT1";
ctg000020_np1212 AUGUSTUS exon 489924 492452 0.96 - . gene_id "EFT1"; transcript_id "EFT1.t1"; ID "nbis-exon-1125"; Parent "EFT1.t1";
ctg000020_np1212 AUGUSTUS exon 1190115 1192643 0.96 - . gene_id "EFT1"; transcript_id "EFT1.t1"; ID "nbis-exon-1126"; Parent "EFT1.t1";
ctg000020_np1212 AUGUSTUS CDS 489927 492452 0.96 - 0 gene_id "EFT1"; transcript_id "EFT1.t1"; ID "EFT1.t1.cds"; Parent "EFT1.t1";
ctg000030_np1212 AUGUSTUS CDS 1190115 1192640 1 + 0 gene_id "EFT1"; transcript_id "EFT1.t1"; ID "EFT1.t1.cds"; Parent "EFT1.t1";
ctg000020_np1212 AUGUSTUS five_prime_utr 1192641 1192643 0.96 - . gene_id "EFT1"; transcript_id "EFT1.t1"; ID "nbis-five_prime_utr-5"; Parent "EFT1.t1"; original_biotype "five_prime_UTR";
ctg000020_np1212 AUGUSTUS start_codon 492450 492452 . - 0 gene_id "EFT1"; transcript_id "EFT1.t1"; ID "start_codon-2028"; Parent "EFT1.t1";
ctg000030_np1212 AUGUSTUS start_codon 1190115 1190117 . + 0 gene_id "EFT1"; transcript_id "EFT1.t1"; ID "start_codon-4963"; Parent "EFT1.t1";
ctg000020_np1212 AUGUSTUS stop_codon 489924 489926 . - 0 gene_id "EFT1"; transcript_id "EFT1.t1"; ID "stop_codon-2027"; Parent "EFT1.t1";
ctg000030_np1212 AUGUSTUS stop_codon 1192641 1192643 . + 0 gene_id "EFT1"; transcript_id "EFT1.t1"; ID "stop_codon-4962"; Parent "EFT1.t1";
ctg000020_np1212 AUGUSTUS three_prime_utr 489924 489926 0.96 - . gene_id "EFT1"; transcript_id "EFT1.t1"; ID "nbis-three_prime_utr-16"; Parent "EFT1.t1"; original_biotype "three_prime_UTR";
ctg000030_np1212 AUGUSTUS transcript 1190115 1192643 1 + . gene_id "EFT1"; transcript_id "nbis-transcript-145"; ID "nbis-transcript-145"; Parent "EFT1";
Thanks for your fast answer and clarification ! :)
EDIT: I also saw this thread https://github.com/Gaius-Augustus/Augustus/issues/74 where they advise not to use --gff3 with Augustus.
ID attribute must be unique for most of the features, if you used the name attribute as ID then you probably mess up your file because many genes have same name. (can be on same and/or different contigs). Name must stay in a specific attribute.
I corrected my script to have unique IDs and output file is now correct after AGAT conversion.
Describe the bug When using agat_convert_sp_gff2gtf.pl, my output GTF file largely differs from the input GFF file.
General (please complete the following information):
To Reproduce
agat_convert_sp_gff2gtf.pl --gff /data/nextpolish_polished_assembly_annotated.gff3 -o /data/nextpolish_polished_assembly_annotated.gtf
Yeast assembly obtained with nextdenovo and polished with nextpolish.
Expected behavior Same output features than in input.
Observed behavior New transcripts appeared, very long that span the whole chromosome (e.g. HPF1, EFT1), and EFT1 as end coordinate higher than chromosome end... so program downstream crash.
Screenshots Left arm end (first 2 top tracks are GFF, bottom track is new GTF):
Right arm end (first 2 top tracks are GFF, bottom track is new GTF):
Additional context Genome is yeast, it might be that same gene is found in another chromosome and mess is done between different copies ?