NBISweden / AGAT

Another Gtf/Gff Analysis Toolkit
GNU General Public License v3.0
467 stars 56 forks source link

Lines missing after using agat_convert_sp_gff2gtf.pl #502

Closed JC-therea closed 1 month ago

JC-therea commented 1 month ago

Many lines are missing and some canonical genes now are classified as "RNA".

I wanted to fix the gtf file that I received from a companion. This is the file.
The file miss some important features like gene and transcript (or mRNA) so I used the tool agat_convert_sp_gff2gtf.pl to keep it as a gtf file.

agat_convert_sp_gff2gtf.pl --gtf 'Araport11_GFF3_genes_transposons.201606.corrected.gtf' -o atha_v2.gtf

I expected to receive a very similar file but with the features gene and mRNA.

To compare the input and the output here are the before and after: Original file:

Chr1    Araport11       5UTR    3631    3759    .       +       .       transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1    Araport11       exon    3631    3913    .       +       .       transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1    Araport11       start_codon     3760    3762    .       +       .       transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1    Araport11       CDS     3760    3913    .       +       0       transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1    Araport11       exon    3996    4276    .       +       .       transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1    Araport11       CDS     3996    4276    .       +       2       transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1    Araport11       exon    4486    4605    .       +       .       transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1    Araport11       CDS     4486    4605    .       +       0       transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1    Araport11       exon    4706    5095    .       +       .       transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1    Araport11       CDS     4706    5095    .       +       0       transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1    Araport11       exon    5174    5326    .       +       .       transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1    Araport11       CDS     5174    5326    .       +       0       transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1    Araport11       CDS     5439    5630    .       +       0       transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1    Araport11       exon    5439    5899    .       +       .       transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1    Araport11       stop_codon      5628    5630    .       +       .       transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1    Araport11       3UTR    5631    5899    .       +       .       transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1    Araport11       exon    6788    7069    .       -       .       transcript_id "AT1G01020.2"; gene_id "AT1G01020";
Chr1    Araport11       3UTR    6788    7069    .       -       .       transcript_id "AT1G01020.2"; gene_id "AT1G01020";
Chr1    Araport11       3UTR    7157    7314    .       -       .       transcript_id "AT1G01020.2"; gene_id "AT1G01020";
Chr1    Araport11       exon    7157    7450    .       -       .       transcript_id "AT1G01020.2"; gene_id "AT1G01020";
Chr1    Araport11       CDS     7315    7450    .       -       1       transcript_id "AT1G01020.2"; gene_id "AT1G01020";
Chr1    Araport11       start_codon     7448    7450    .       -       .       transcript_id "AT1G01020.2"; gene_id "AT1G01020";
Chr1    Araport11       exon    7564    7649    .       -       .       transcript_id "AT1G01020.2"; gene_id "AT1G01020";
Chr1    Araport11       CDS     7564    7649    .       -       0       transcript_id "AT1G01020.2"; gene_id "AT1G01020";
Chr1    Araport11       exon    7762    7835    .       -       .       transcript_id "AT1G01020.2"; gene_id "AT1G01020";
Chr1    Araport11       CDS     7762    7835    .       -       2       transcript_id "AT1G01020.2"; gene_id "AT1G01020";
Chr1    Araport11       exon    7942    7987    .       -       .       transcript_id "AT1G01020.2"; gene_id "AT1G01020";
Chr1    Araport11       CDS     7942    7987    .       -       0       transcript_id "AT1G01020.2"; gene_id "AT1G01020";
Chr1    Araport11       exon    8236    8325    .       -       .       transcript_id "AT1G01020.2"; gene_id "AT1G01020";
Chr1    Araport11       CDS     8236    8325    .       -       0       transcript_id "AT1G01020.2"; gene_id "AT1G01020";
Chr1    Araport11       exon    8417    8464    .       -       .       transcript_id "AT1G01020.2"; gene_id "AT1G01020";
Chr1    Araport11       CDS     8417    8464    .       -       0       transcript_id "AT1G01020.2"; gene_id "AT1G01020";
Chr1    Araport11       stop_codon      8571    8573    .       -       .       transcript_id "AT1G01020.2"; gene_id "AT1G01020";
Chr1    Araport11       CDS     8571    8666    .       -       0       transcript_id "AT1G01020.2"; gene_id "AT1G01020";
Chr1    Araport11       exon    8571    8737    .       -       .       transcript_id "AT1G01020.2"; gene_id "AT1G01020";
Chr1    Araport11       5UTR    8667    8737    .       -       .       transcript_id "AT1G01020.2"; gene_id "AT1G01020";

After agat:

##gtf-version X
# GFF-like GTF i.e. not checked against any GTF specification. Conversion based on GFF input, standardised by AGAT.
chloroplast     AGAT    gene    4       76      .       -       .       gene_id "ATCG00010"; transcript_id "ATCG00010.1"; ID "ATCG00010";
chloroplast     AGAT    RNA     4       76      .       -       .       gene_id "ATCG00010"; transcript_id "ATCG00010.1"; ID "ATCG00010.1"; Parent "ATCG00010";
chloroplast     Araport11       exon    4       76      73      -       .       gene_id "ATCG00010"; transcript_id "ATCG00010.1"; ID "agat-exon-322058"; Parent "ATCG00010.1";
chloroplast     AGAT    gene    383     1444    .       -       .       gene_id "ATCG00020"; transcript_id "ATCG00020.1"; ID "ATCG00020";
chloroplast     AGAT    mRNA    383     1444    .       -       .       gene_id "ATCG00020"; transcript_id "ATCG00020.1"; ID "ATCG00020.1"; Parent "ATCG00020";
chloroplast     Araport11       exon    383     1444    .       -       .       gene_id "ATCG00020"; transcript_id "ATCG00020.1"; ID "agat-exon-322059"; Parent "ATCG00020.1";
chloroplast     Araport11       CDS     383     1444    .       -       0       gene_id "ATCG00020"; transcript_id "ATCG00020.1"; ID "agat-cds-286112"; Parent "ATCG00020.1";

As you can see, many lines related to CDSs and exons are missing. Given the original GTF, is this output expected? Do you think I should use another tool before agat_convert_sp_gff2gtf.pl ?

Juke34 commented 1 month ago

What is classified RNA are records that do not contain any CDS. AGAT cannot tell if it is ncRNA, tRNA,miscRNA,etc. The fact that it stops early and does not output everything is problematic. I had this bug in an earlier version where I forgot to remove a debug line. But it was supposed to be fixed in version 1.4. Could you check with the 1.4.1? Can you check that you are really using it he version 1.4?

JC-therea commented 1 month ago

Thanks for your quick answer Jacques,

Thank you for the explanation of the RNA feature.

The version that I used in conda:

$ agat --version
v1.4.0

After updating to version 1.4.1 here is the output of the same command that I post before:

##gtf-version X
# GFF-like GTF i.e. not checked against any GTF specification. Conversion based on GFF input, standardised by AGAT.
chloroplast     AGAT    gene    4       76      .       -       .       gene_id "ATCG00010"; ID "ATCG00010";
chloroplast     AGAT    RNA     4       76      .       -       .       gene_id "ATCG00010"; transcript_id "ATCG00010.1"; ID "ATCG00010.1"; Parent "ATCG00010";
chloroplast     Araport11       exon    4       76      73      -       .       gene_id "ATCG00010"; transcript_id "ATCG00010.1"; ID "agat-exon-322058"; Parent "ATCG00010.1";
chloroplast     AGAT    gene    383     1444    .       -       .       gene_id "ATCG00020"; ID "ATCG00020";
chloroplast     AGAT    mRNA    383     1444    .       -       .       gene_id "ATCG00020"; transcript_id "ATCG00020.1"; ID "ATCG00020.1"; Parent "ATCG00020";
chloroplast     Araport11       exon    383     1444    .       -       .       gene_id "ATCG00020"; transcript_id "ATCG00020.1"; ID "agat-exon-322059"; Parent "ATCG00020.1";
chloroplast     Araport11       CDS     383     1444    .       -       0       gene_id "ATCG00020"; transcript_id "ATCG00020.1"; ID "agat-cds-286112"; Parent "ATCG00020.1";

However this time I saw many warnings:

Warning: at3g19820.3 stop codon not adjacent to the CDS
Warning: at3g19830.1 stop codon not adjacent to the CDS
Warning: at3g19830.2 stop codon not adjacent to the CDS

Maybe this is the problem... Do you think that I should remove features that are not CDS or exon to avoid those warnings and maybe fix the file? Here are all the features:

$ cut -f 3 'Araport11_GFF3_genes_transposons.201606.corrected.gtf' | sort | uniq -c
  52672 3UTR
  60686 5UTR
 286355 CDS
 322385 exon
  48095 start_codon
  48106 stop_codon
Juke34 commented 1 month ago

The warning should not stop the process. I will have to investigate the problem. Sorry

JC-therea commented 1 month ago

Don't worry, I'm not in a hurry! Your tool has helped me immensely on countless occasions.

Thank you so much for your work and for AGAT.

Bests

JC-therea commented 1 month ago

Dear Jacques,

After reviewing the input and output of AGAT in more detail, I realized the error was mine when visualizing the created file. I apologize for any time I may have taken from you with this issue.

Bests

Juke34 commented 1 month ago

Great! Thank you for your feedback.