NBISweden / AGAT

Another Gtf/Gff Analysis Toolkit
GNU General Public License v3.0
431 stars 52 forks source link

Bug when file parsed with GFF2 and GFF2.5 #448

Closed Juke34 closed 2 months ago

Juke34 commented 2 months ago

Some warnings like this one are thrown:

WARNING level1: This feature level1 is not a duplicate but has an ID already used.
/!\ AGAT might mix up the child features and create chimeric records.
Indeed we changed the ID for this L1 feature to be unique but we do not
change the Parent attribute of the child features to reflect this change.
Why? because we do not know to which L1 the child feature was part-of because several Parent have similar ID.
@ the feature is:
chr10p ambMex60DD  gene      770040 1018740              1000      +             .               ID "agat-gene-4"  ; gene_id 05 ; gene_name AMEX60DD000005

An example of input and output is shown below:

INPUT:

chr10p ambMex60DD  gene      770040 1018740              1000      +             .               gene_id "AMEX60DD000005"; gene_name "AMEX60DD000005";
chr10p ambMex60DD  transcript           770040 1018740              1000      +             .               gene_id "AMEX60DD000005"; transcript_id "AMEX60DD201000005.1"; gene_name "AMEX60DD201000005.1";
chr10p ambMex60DD  exon      770040 770424 1000      +             .               gene_id "AMEX60DD000005"; transcript_id "AMEX60DD201000005.1"; exon_number "1";
chr10p ambMex60DD  exon      801485 801606 1000      +             .               gene_id "AMEX60DD000005"; transcript_id "AMEX60DD201000005.1"; exon_number "2";
chr10p ambMex60DD  exon      915118 915167 1000      +             .               gene_id "AMEX60DD000005"; transcript_id "AMEX60DD201000005.1"; exon_number "3";
chr10p ambMex60DD  exon      1018684              1018740              1000      +             .               gene_id "AMEX60DD000005"; transcript_id "AMEX60DD201000005.1"; exon_number "4";
chr10p ambMex60DD  transcript           770040 961083 1000      +             .               gene_id "AMEX60DD000005"; transcript_id "AMEX60DD201000005.2"; gene_name "AMEX60DD201000005.2"; ORF_type "Predicted";
chr10p ambMex60DD  exon      770040 770414 1000      +             .               gene_id "AMEX60DD000005"; transcript_id "AMEX60DD201000005.2"; exon_number "1";
chr10p ambMex60DD  exon      801485 801606 1000      +             .               gene_id "AMEX60DD000005"; transcript_id "AMEX60DD201000005.2"; exon_number "2";
chr10p ambMex60DD  exon      877425 877977 1000      +             .               gene_id "AMEX60DD000005"; transcript_id "AMEX60DD201000005.2"; exon_number "3";
chr10p ambMex60DD  exon      915118 915167 1000      +             .               gene_id "AMEX60DD000005"; transcript_id "AMEX60DD201000005.2"; exon_number "4";
chr10p ambMex60DD  exon      960764 961083 1000      +             .               gene_id "AMEX60DD000005"; transcript_id "AMEX60DD201000005.2"; exon_number "5";

OUTPUT:

chr10p ambMex60DD  gene      770040 1018740              1000      +             .               gene_id "AMEX60DD000005"; ID "agat-gene-4"; gene_name "AMEX60DD000005";
chr10p AGAT    RNA       770040 1018740              .               +             .               gene_id "AMEX60DD000005"; transcript_id "AMEX60DD201000005.1"; ID "AMEX60DD201000005.1"; Parent "agat-gene-4"; exon_number "1";
chr10p ambMex60DD  exon      770040 770424 1000      +             .               gene_id "AMEX60DD000005"; transcript_id "AMEX60DD201000005.1"; ID "agat-exon-13"; Parent "AMEX60DD201000005.1"; exon_number "1";
chr10p ambMex60DD  exon      801485 801606 1000      +             .               gene_id "AMEX60DD000005"; transcript_id "AMEX60DD201000005.1"; ID "agat-exon-14"; Parent "AMEX60DD201000005.1"; exon_number "2";
chr10p ambMex60DD  exon      915118 915167 1000      +             .               gene_id "AMEX60DD000005"; transcript_id "AMEX60DD201000005.1"; ID "agat-exon-15"; Parent "AMEX60DD201000005.1"; exon_number "3";
chr10p ambMex60DD  exon      1018684              1018740              1000      +             .               gene_id "AMEX60DD000005"; transcript_id "AMEX60DD201000005.1"; ID "agat-exon-16"; Parent "AMEX60DD201000005.1"; exon_number "4";
chr10p AGAT    gene      770040 1018740              .               +             .               gene_id "AMEX60DD000005"; transcript_id "AMEX60DD201000005.1"; ID "AMEX60DD000005"; gene_name "AMEX60DD201000005.1";
chr10p ambMex60DD  transcript           770040 961083 1000      +             .               gene_id "AMEX60DD000005"; transcript_id "AMEX60DD201000005.2"; ID "AMEX60DD201000005.2"; ORF_type "Predicted"; Parent "AMEX60DD000005"; gene_name "AMEX60DD201000005.2";
chr10p ambMex60DD  exon      770040 770414 1000      +             .               gene_id "AMEX60DD000005"; transcript_id "AMEX60DD201000005.2"; ID "agat-exon-17"; Parent "AMEX60DD201000005.2"; exon_number "1";
chr10p ambMex60DD  exon      801485 801606 1000      +             .               gene_id "AMEX60DD000005"; transcript_id "AMEX60DD201000005.2"; ID "agat-exon-18"; Parent "AMEX60DD201000005.2"; exon_number "2";
chr10p ambMex60DD  exon      877425 877977 1000      +             .               gene_id "AMEX60DD000005"; transcript_id "AMEX60DD201000005.2"; ID "agat-exon-19"; Parent "AMEX60DD201000005.2"; exon_number "3";
chr10p ambMex60DD  exon      915118 915167 1000      +             .               gene_id "AMEX60DD000005"; transcript_id "AMEX60DD201000005.2"; ID "agat-exon-20"; Parent "AMEX60DD201000005.2"; exon_number "4";
chr10p ambMex60DD  exon      960764 961083 1000      +             .               gene_id "AMEX60DD000005"; transcript_id "AMEX60DD201000005.2"; ID "agat-exon-21"; Parent "AMEX60DD201000005.2"; exon_number "5";
chr10p ambMex60DD  transcript           770040 1018740              1000      +             .               gene_id "AMEX60DD000005"; transcript_id "AMEX60DD201000005.1"; ID "agat-transcript-3"; Parent "AMEX60DD000005"; gene_name "AMEX60DD201000005.1";
Juke34 commented 2 months ago

The error has been identified. It is due to extra empty ID attribute added by Bioperl when parsing with GFF2 and GFF2.5 parameter. The empty id is problematic because handled in priority over gene_id attribute by AGAT, and then features share the same "empty" ID.... To avoid the underlying problem in AGAT a check should be added to remove this extra ID that is not supposed to be here.