NBISweden / AGAT

Another Gtf/Gff Analysis Toolkit
GNU General Public License v3.0
431 stars 52 forks source link

agat_sp_compare_two_annotations gene number was inconsistent #444

Closed dongdongdong0203 closed 3 months ago

dongdongdong0203 commented 3 months ago

Describe the bug I used the agat_sp_compare_two_annotations.pl script to compare the reference GTF with the predicted GTF of the full-length transcriptome with a view to obtaining differences in the predicted transcripts or genes.

General (please complete the following information):

To Reproduce agat_sp_compare_two_annotations.pl -gff1 $Refgtf -gff2 ${inpath}/OUT.extended_annotation.gtf -o ${inpath}/${sample[$SLURM_ARRAY_TASK_ID]}

However, it turns out that the number of genes in the results file doesn't match what I counted with awk image

awk '$3=="gene"' $Refgtf | wc -l 35670 awk '$3=="gene"' OUT.extended_annotation.gtf | wc -l 33226`

Does this result make sense, or is there a problem with my command.

Thanks

Juke34 commented 3 months ago

The awk command does not make any distinction of what type of "gene", while AGAT does... what you present is the genes that have a transcript and a cds. I guess you have another table with the results of gene that have a transcript but no CDS (only exon as for non coding gene) i.e. gene@trancript@exon, you may also have a table with gene that have tRNA and exons i.e. gene@trna@exon etc.... gene@pseudogne@exon ... So you have to make a total of all of these results.

dongdongdong0203 commented 3 months ago

Thank you for your prompt response and comments.

As you correctly guessed, my results included 'gene@transcript@cds' and 'gene@transcript@exon'. The total number of genes for both results is consistent with the number of genes in the GTF files.

However, the predicted features of the GTF files do not include CDS (as shown below). Could you please explain how 'gene@transcript@cds' enables comparison in this case? Based on my predicted GTF file, there are no CDS present. The aim is to compare the transcript level differences. If necessary, please advise on how to modify the config file. Thank you. As a beginner, I would appreciate your assistance.

Thanks.

image

Juke34 commented 3 months ago

Did you check with awk or agat_sq_stat_basic.pl that you do not have any CDS? You necessarily have CDS in one of the files

dongdongdong0203 commented 3 months ago

Dear @Juke34

The reference GTF used in the analysis contains CDS, whereas the predicted GTF file does not.

Additionally, two GTF files without CDS were tested, resulting in a comparison of only gene@transcript@exon, which is the desired outcome from agat_sp_compare_two_annotations.pl. Thanks for your patience!

Best RUAN