NBISweden / AGAT

Another Gtf/Gff Analysis Toolkit
GNU General Public License v3.0
465 stars 56 forks source link

agat_sp_manage_IDs.pl changing the gene number randomly after adding suffix. #411

Closed kashiff007 closed 10 months ago

kashiff007 commented 10 months ago

My gff file looks like this:

chr1A   AUGUSTUS        gene    23087   24487   0.86    +       .       ID=g1
chr1A   AUGUSTUS        transcript      23087   24487   0.86    +       .       ID=g1.t1;Parent=g1;HintSupport=0.00
chr1A   AUGUSTUS        start_codon     23087   23089   .       +       0       Parent=g1.t1
chr1A   AUGUSTUS        CDS     23087   24487   0.86    +       0       ID=g1.t1.cds;Parent=g1.t1
chr1A   AUGUSTUS        stop_codon      24485   24487   .       +       0       Parent=g1.t1
chr1A   AUGUSTUS        gene    24672   25920   0.17    +       .       ID=g2
chr1A   AUGUSTUS        transcript      24672   25920   0.17    +       .       ID=g2.t1;Parent=g2;HintSupport=0.00
chr1A   AUGUSTUS        start_codon     24672   24674   .       +       0       Parent=g2.t1
chr1A   AUGUSTUS        intron  25162   25261   0.23    +       .       Parent=g2.t1
chr1A   AUGUSTUS        CDS     24672   25161   0.35    +       0       ID=g2.t1.cds;Parent=g2.t1
chr1A   AUGUSTUS        CDS     25262   25920   0.23    +       2       ID=g2.t1.cds;Parent=g2.t1
chr1A   AUGUSTUS        stop_codon      25918   25920   .       +       0       Parent=g2.t1

I want to add suffix before gene names and keep the last part as original with agat_sp_manage_IDs.pl (Version: v0.8.0). I have used following command agat_sp_manage_IDs.pl -f A.gff --prefix Dis.W6-48549-006.v1.___ --tair --type_dependent -o A_new_rename.gff --ensembl There are two major problems occurring while performing this:

  1. The suffix have been added but the last original parts are reassigned randomly with different number. Every trial produce different assigned number.
  2. The file is sorted randomly and not as input provided.

I tried without --tair option too; producing same error.

The output looks like:

chr1A   AUGUSTUS        gene    23087   24487   0.86    +       .       ID=Dis.W6-48549-006.v1.___G00000044886
chr1A   AUGUSTUS        transcript      23087   24487   0.86    +       .       ID=Dis.W6-48549-006.v1.___G00000044886.1;Parent=Dis.W6-48549-006.v1.___G00000044886;hintSupport=0.00
chr1A   AUGUSTUS        exon    23087   24487   0.86    +       .       ID=Dis.W6-48549-006.v1.___G00000044886.1-exon1;Parent=Dis.W6-48549-006.v1.___G00000044886.1
chr1A   AUGUSTUS        CDS     23087   24487   0.86    +       0       ID=Dis.W6-48549-006.v1.___G00000044886.1-cds1;Parent=Dis.W6-48549-006.v1.___G00000044886.1
chr1A   AUGUSTUS        start_codon     23087   23089   .       +       0       ID=Dis.W6-48549-006.v1.___G00000044886.1-start_codon1;Parent=Dis.W6-48549-006.v1.___G00000044886.1
chr1A   AUGUSTUS        stop_codon      24485   24487   .       +       0       ID=Dis.W6-48549-006.v1.___G00000044886.1-stop_codon1;Parent=Dis.W6-48549-006.v1.___G00000044886.1
chr1A   AUGUSTUS        gene    24672   25920   0.17    +       .       ID=Dis.W6-48549-006.v1.___G00000044887
chr1A   AUGUSTUS        transcript      24672   25920   0.17    +       .       ID=Dis.W6-48549-006.v1.___G00000044887.1;Parent=Dis.W6-48549-006.v1.___G00000044887;hintSupport=0.00
chr1A   AUGUSTUS        exon    24672   25161   0.35    +       .       ID=Dis.W6-48549-006.v1.___G00000044887.1-exon1;Parent=Dis.W6-48549-006.v1.___G00000044887.1
chr1A   AUGUSTUS        exon    25262   25920   0.35    +       .       ID=Dis.W6-48549-006.v1.___G00000044887.1-exon2;Parent=Dis.W6-48549-006.v1.___G00000044887.1
chr1A   AUGUSTUS        CDS     24672   25161   0.35    +       0       ID=Dis.W6-48549-006.v1.___G00000044887.1-cds1;Parent=Dis.W6-48549-006.v1.___G00000044887.1
chr1A   AUGUSTUS        CDS     25262   25920   0.23    +       2       ID=Dis.W6-48549-006.v1.___G00000044887.1-cds2;Parent=Dis.W6-48549-006.v1.___G00000044887.1
chr1A   AUGUSTUS        intron  25162   25261   0.23    +       .       ID=Dis.W6-48549-006.v1.___G00000044887.1-intron1;Parent=Dis.W6-48549-006.v1.___G00000044887.1
chr1A   AUGUSTUS        start_codon     24672   24674   .       +       0       ID=Dis.W6-48549-006.v1.___G00000044887.1-start_codon1;Parent=Dis.W6-48549-006.v1.___G00000044887.1
chr1A   AUGUSTUS        stop_codon      25918   25920   .       +       0       ID=Dis.W6-48549-006.v1.___G00000044887.1-stop_codon1;Parent=Dis.W6-48549-006.v1.___G00000044887.1

Could you suggest the possible reason for this?

My expected outcome should looks like:

chr1A   AUGUSTUS        gene    23087   24487   0.86    +       .       ID=Dis.W6-48549-006.v1.___G0000001
chr1A   AUGUSTUS        transcript      23087   24487   0.86    +       .       ID=Dis.W6-48549-006.v1.___G0000001.1;Parent=Dis.W6-48549-006.v1.___G0000001;hintSupport=0.00
chr1A   AUGUSTUS        exon    23087   24487   0.86    +       .       ID=Dis.W6-48549-006.v1.___G0000001.1-exon1;Parent=Dis.W6-48549-006.v1.___G0000001.1
chr1A   AUGUSTUS        CDS     23087   24487   0.86    +       0       ID=Dis.W6-48549-006.v1.___G0000001.1-cds1;Parent=Dis.W6-48549-006.v1.___G0000001.1
chr1A   AUGUSTUS        start_codon     23087   23089   .       +       0       ID=Dis.W6-48549-006.v1.___G0000001.1-start_codon1;Parent=Dis.W6-48549-006.v1.___G0000001.1
chr1A   AUGUSTUS        stop_codon      24485   24487   .       +       0       ID=Dis.W6-48549-006.v1.___G0000001.1-stop_codon1;Parent=Dis.W6-48549-006.v1.___G0000001.1
chr1A   AUGUSTUS        gene    24672   25920   0.17    +       .       ID=Dis.W6-48549-006.v1.___G0000002
chr1A   AUGUSTUS        transcript      24672   25920   0.17    +       .       ID=Dis.W6-48549-006.v1.___G0000002.1;Parent=Dis.W6-48549-006.v1.___G0000002;hintSupport=0.00
chr1A   AUGUSTUS        exon    24672   25161   0.35    +       .       ID=Dis.W6-48549-006.v1.___G0000002.1-exon1;Parent=Dis.W6-48549-006.v1.___G0000002.1
chr1A   AUGUSTUS        exon    25262   25920   0.35    +       .       ID=Dis.W6-48549-006.v1.___G0000002.1-exon2;Parent=Dis.W6-48549-006.v1.___G0000002.1
chr1A   AUGUSTUS        CDS     24672   25161   0.35    +       0       ID=Dis.W6-48549-006.v1.___G0000002.1-cds1;Parent=Dis.W6-48549-006.v1.___G0000002.1
chr1A   AUGUSTUS        CDS     25262   25920   0.23    +       2       ID=Dis.W6-48549-006.v1.___G0000002.1-cds2;Parent=Dis.W6-48549-006.v1.___G0000002.1
chr1A   AUGUSTUS        intron  25162   25261   0.23    +       .       ID=Dis.W6-48549-006.v1.___G0000002.1-intron1;Parent=Dis.W6-48549-006.v1.___G0000002.1
chr1A   AUGUSTUS        start_codon     24672   24674   .       +       0       ID=Dis.W6-48549-006.v1.___G0000002.1-start_codon1;Parent=Dis.W6-48549-006.v1.___G0000002.1
chr1A   AUGUSTUS        stop_codon      25918   25920   .       +       0       ID=Dis.W6-48549-006.v1.___G0000002.1-stop_codon1;Parent=Dis.W6-48549-006.v1.___G0000002.1
Juke34 commented 10 months ago

Hi, Thank you for using AGAT and for your feedback.

Some sorting fix have been added in more recent version. Could you use the latest version v1.2? It might fix the fact that at every trial produce different assigned number. If it is not fixed I will have to investigate the issue more deeply.

The sorting of feature type between input and output (e.g start_codon comes earlier in your input) can differ but the information is actually the same. Without tabix activated, with AGAT the gene comes first (level1) , then comes transcript (level2), and then the sub features are in this order: tss>>exon>>cds>>tts>> any other level3 features in alphabetical order. So start_codon will appear differently than in your input file. If the order of the feature really matters, we could think to update AGAT to be able to pass the order wanted via the config file.

Keep me informed. Best regards

Neato-Nick commented 10 months ago

@Juke34 Unfortunately, it's not fixed in the latest version. I was about to open a new issue but found this one instead. The problem I have is not with ordering of features in the output, it's the number suffix getting added that's random.

I'm using the latest version 1.2 with the singularity image. I had a similar problem way back (https://github.com/NBISweden/AGAT/issues/143) but there's no container for that version where you fixed it, 0.7. So, I tried v0.8 and I had the same problem as I do with v1.2.

I can help you narrow down the issue: I've noticed the issue is between chromosomes. Within a chromosome, the sequential numbering is accurate. Here's a reproducible example. In this example, test.gff3 was output from another AGAT script so all the features are already in the expected order.

> cat test.gff3
##gff-version 3
CP060339.1  Liftoff gene    2655    3026    .   +   .   ID=B9J08_004102;Name=hypothetical protein
CP060339.1  Liftoff mRNA    2655    3026    .   +   .   ID=B9J08_004102T0;Parent=B9J08_004102;Name=hypothetical protein
CP060339.1  Liftoff exon    2655    3026    .   +   0   ID=B9J08_004102.exon1;Parent=B9J08_004102T0;Name=g3982.t1:CDS:1
CP060339.1  Liftoff CDS 2655    3026    .   +   0   ID=cds.B9J08_004102;Parent=B9J08_004102T0;Name=g3982.t1:CDS:1
CP060339.1  Liftoff gene    5717    6577    .   -   .   ID=B9J08_004101;Name=hypothetical protein
CP060339.1  Liftoff mRNA    5717    6577    .   -   .   ID=B9J08_004101T0;Parent=B9J08_004101;Name=hypothetical protein
CP060339.1  Liftoff exon    5717    6577    .   -   0   ID=B9J08_004101.exon1;Parent=B9J08_004101T0;Name=g3981.t1:CDS:1
CP060339.1  Liftoff CDS 5717    6577    .   -   0   ID=cds.B9J08_004101;Parent=B9J08_004101T0;Name=g3981.t1:CDS:1
CP060339.1  Liftoff gene    7933    10269   .   -   .   ID=B9J08_004100;Name=hypothetical protein
CP060339.1  Liftoff mRNA    7933    10269   .   -   .   ID=B9J08_004100T0;Parent=B9J08_004100;Name=hypothetical protein
CP060339.1  Liftoff exon    7933    10269   .   -   0   ID=B9J08_004100.exon1;Parent=B9J08_004100T0;Name=g3980.t1:CDS:1
CP060339.1  Liftoff CDS 7933    10269   .   -   0   ID=cds.B9J08_004100;Parent=B9J08_004100T0;Name=g3980.t1:CDS:1
CP060340.1  Liftoff gene    7166    7537    .   +   .   ID=B9J08_001054;Name=hypothetical protein
CP060340.1  Liftoff mRNA    7166    7537    .   +   .   ID=B9J08_001054T0;Parent=B9J08_001054;Name=hypothetical protein
CP060340.1  Liftoff exon    7166    7537    .   +   0   ID=B9J08_001054.exon1;Parent=B9J08_001054T0;Name=g1021.t1:CDS:1
CP060340.1  Liftoff CDS 7166    7537    .   +   0   ID=cds.B9J08_001054;Parent=B9J08_001054T0;Name=g1021.t1:CDS:1
CP060345.1  Liftoff gene    763950  764423  .   +   .   ID=B9J08_002579;Name=hypothetical protein
CP060341.1  Liftoff gene    5563    7965    .   +   .   ID=B9J08_001529;Name=hypothetical protein
CP060341.1  Liftoff mRNA    5563    7965    .   +   .   ID=B9J08_001529T0;Parent=B9J08_001529;Name=hypothetical protein
CP060341.1  Liftoff exon    5563    7965    .   +   0   ID=B9J08_001529.exon1;Parent=B9J08_001529T0;Name=g1476.t1:CDS:1
CP060341.1  Liftoff CDS 5563    7965    .   +   0   ID=cds.B9J08_001529;Parent=B9J08_001529T0;Name=g1476.t1:CDS:1
CP060341.1  Liftoff gene    8798    10381   .   +   .   ID=B9J08_001528;Name=hypothetical protein
CP060341.1  Liftoff mRNA    8798    10381   .   +   .   ID=B9J08_001528T0;Parent=B9J08_001528;Name=hypothetical protein
CP060341.1  Liftoff exon    8798    10381   .   +   0   ID=B9J08_001528.exon1;Parent=B9J08_001528T0;Name=g1475.t1:CDS:1
CP060341.1  Liftoff CDS 8798    10381   .   +   0   ID=cds.B9J08_001528;Parent=B9J08_001528T0;Name=g1475.t1:CDS:1
CP060345.1  Liftoff gene    770324  772909  .   +   .   ID=B9J08_002582;Name=hypothetical protein
CP060345.1  Liftoff mRNA    770324  772909  .   +   .   ID=B9J08_002582T0;Parent=B9J08_002582;Name=hypothetical protein
CP060345.1  Liftoff exon    770324  772909  .   +   0   ID=B9J08_002582.exon1;Parent=B9J08_002582T0;Name=g2509.t1:CDS:1
CP060345.1  Liftoff CDS 770324  772909  .   +   0   ID=cds.B9J08_002582;Parent=B9J08_002582T0;Name=g2509.t1:CDS:1
CP060345.1  Liftoff gene    774948  776690  .   +   .   ID=B9J08_002583;Name=hypothetical protein
CP060345.1  Liftoff mRNA    774948  776690  .   +   .   ID=B9J08_002583T0;Parent=B9J08_002583;Name=hypothetical protein
CP060345.1  Liftoff exon    774948  776690  .   +   0   ID=B9J08_002583.exon1;Parent=B9J08_002583T0;Name=g2510.t1:CDS:1
CP060345.1  Liftoff CDS 774948  776690  .   +   0   ID=cds.B9J08_002583;Parent=B9J08_002583T0;Name=g2510.t1:CDS:1
CP060345.1  Liftoff gene    777929  779770  .   +   .   ID=B9J08_002584;Name=hypothetical protein
CP060345.1  Liftoff mRNA    777929  779770  .   +   .   ID=B9J08_002584T0;Parent=B9J08_002584;Name=hypothetical protein
CP060345.1  Liftoff exon    777929  779770  .   +   0   ID=B9J08_002584.exon1;Parent=B9J08_002584T0;Name=g2511.t1:CDS:1
CP060345.1  Liftoff CDS 777929  779770  .   +   0   ID=cds.B9J08_002584;Parent=B9J08_002584T0;Name=g2511.t1:CDS:1
> agat_sp_manage_IDs.pl --gff test.gff3 --out test.IDs.gff3 --prefix "foobar_" --tair
> cat test.IDs.gff3
##gff-version 3
CP060339.1  Liftoff gene    2655    3026    .   +   .   ID=foobar_5;Name=hypothetical protein
CP060339.1  Liftoff mRNA    2655    3026    .   +   .   ID=foobar_5.1;Parent=foobar_5;Name=hypothetical protein
CP060339.1  Liftoff exon    2655    3026    .   +   0   ID=foobar_5.1-exon1;Parent=foobar_5.1;Name=g3982.t1:CDS:1
CP060339.1  Liftoff CDS 2655    3026    .   +   0   ID=foobar_5.1-cds2;Parent=foobar_5.1;Name=g3982.t1:CDS:1
CP060339.1  Liftoff gene    5717    6577    .   -   .   ID=foobar_6;Name=hypothetical protein
CP060339.1  Liftoff mRNA    5717    6577    .   -   .   ID=foobar_6.1;Parent=foobar_6;Name=hypothetical protein
CP060339.1  Liftoff exon    5717    6577    .   -   0   ID=foobar_6.1-exon1;Parent=foobar_6.1;Name=g3981.t1:CDS:1
CP060339.1  Liftoff CDS 5717    6577    .   -   0   ID=foobar_6.1-cds2;Parent=foobar_6.1;Name=g3981.t1:CDS:1
CP060339.1  Liftoff gene    7933    10269   .   -   .   ID=foobar_7;Name=hypothetical protein
CP060339.1  Liftoff mRNA    7933    10269   .   -   .   ID=foobar_7.1;Parent=foobar_7;Name=hypothetical protein
CP060339.1  Liftoff exon    7933    10269   .   -   0   ID=foobar_7.1-exon1;Parent=foobar_7.1;Name=g3980.t1:CDS:1
CP060339.1  Liftoff CDS 7933    10269   .   -   0   ID=foobar_7.1-cds2;Parent=foobar_7.1;Name=g3980.t1:CDS:1
CP060340.1  Liftoff gene    7166    7537    .   +   .   ID=foobar_1;Name=hypothetical protein
CP060340.1  Liftoff mRNA    7166    7537    .   +   .   ID=foobar_1.1;Parent=foobar_1;Name=hypothetical protein
CP060340.1  Liftoff exon    7166    7537    .   +   0   ID=foobar_1.1-exon1;Parent=foobar_1.1;Name=g1021.t1:CDS:1
CP060340.1  Liftoff CDS 7166    7537    .   +   0   ID=foobar_1.1-cds2;Parent=foobar_1.1;Name=g1021.t1:CDS:1
CP060341.1  Liftoff gene    5563    7965    .   +   .   ID=foobar_8;Name=hypothetical protein
CP060341.1  Liftoff mRNA    5563    7965    .   +   .   ID=foobar_8.1;Parent=foobar_8;Name=hypothetical protein
CP060341.1  Liftoff exon    5563    7965    .   +   0   ID=foobar_8.1-exon1;Parent=foobar_8.1;Name=g1476.t1:CDS:1
CP060341.1  Liftoff CDS 5563    7965    .   +   0   ID=foobar_8.1-cds2;Parent=foobar_8.1;Name=g1476.t1:CDS:1
CP060341.1  Liftoff gene    8798    10381   .   +   .   ID=foobar_9;Name=hypothetical protein
CP060341.1  Liftoff mRNA    8798    10381   .   +   .   ID=foobar_9.1;Parent=foobar_9;Name=hypothetical protein
CP060341.1  Liftoff exon    8798    10381   .   +   0   ID=foobar_9.1-exon1;Parent=foobar_9.1;Name=g1475.t1:CDS:1
CP060341.1  Liftoff CDS 8798    10381   .   +   0   ID=foobar_9.1-cds2;Parent=foobar_9.1;Name=g1475.t1:CDS:1
CP060345.1  Liftoff gene    770324  772909  .   +   .   ID=foobar_2;Name=hypothetical protein
CP060345.1  Liftoff mRNA    770324  772909  .   +   .   ID=foobar_2.1;Parent=foobar_2;Name=hypothetical protein
CP060345.1  Liftoff exon    770324  772909  .   +   0   ID=foobar_2.1-exon1;Parent=foobar_2.1;Name=g2509.t1:CDS:1
CP060345.1  Liftoff CDS 770324  772909  .   +   0   ID=foobar_2.1-cds2;Parent=foobar_2.1;Name=g2509.t1:CDS:1
CP060345.1  Liftoff gene    774948  776690  .   +   .   ID=foobar_3;Name=hypothetical protein
CP060345.1  Liftoff mRNA    774948  776690  .   +   .   ID=foobar_3.1;Parent=foobar_3;Name=hypothetical protein
CP060345.1  Liftoff exon    774948  776690  .   +   0   ID=foobar_3.1-exon1;Parent=foobar_3.1;Name=g2510.t1:CDS:1
CP060345.1  Liftoff CDS 774948  776690  .   +   0   ID=foobar_3.1-cds2;Parent=foobar_3.1;Name=g2510.t1:CDS:1
CP060345.1  Liftoff gene    777929  779770  .   +   .   ID=foobar_4;Name=hypothetical protein
CP060345.1  Liftoff mRNA    777929  779770  .   +   .   ID=foobar_4.1;Parent=foobar_4;Name=hypothetical protein
CP060345.1  Liftoff exon    777929  779770  .   +   0   ID=foobar_4.1-exon1;Parent=foobar_4.1;Name=g2511.t1:CDS:1
CP060345.1  Liftoff CDS 777929  779770  .   +   0   ID=foobar_4.1-cds2;Parent=foobar_4.1;Name=g2511.t1:CDS:1
Juke34 commented 10 months ago

Right I guess there is a de-sync of the way AGAT parse the Dictionary to set new IDs and the way it parse the Dictionary to print the output. I will push a fix. Thank you for the feedback.