NBISweden / AGAT

Another Gtf/Gff Analysis Toolkit
GNU General Public License v3.0
442 stars 54 forks source link

The output file is bigger than input gff #358

Closed ShirelyI closed 1 year ago

ShirelyI commented 1 year ago

Describe the bug

1.The output gff file is bigger than input gff file for script agat_sp_filter_feature_by_attribute_value.pl 2.I tried to using script agat_sp_keep_longest_isoform.pl to generate the longest transcripts by using two ways( one is the input gff file is generated by script agat_sp_filter_feature_by_attribute_value.pl. and the other is use the original gff as input ) and the results are different ,which one should I use ?

General (please complete the following information):

To Reproduce Scripts and parameters to reproduce the behavior.

the first question's command is "agat_sp_filter_feature_by_attribute_value.pl --gff $cycadir/GCF_018340385.1_ASM1834038v1_genomic.gff.gz --attribute gene_biotype --value protein_coding -t '!' -o Cy_carpio.protein_coding.gff3 1>cyca_protein.log 2>&1"

the second question's command is "agat_sp_keep_longest_isoform.pl --gff Cy_carpio.protein_coding.gff3 -o Cy_carpio.longest_isoform.gff3 1>cyca_long.log 2>&1" and "agat_sp_keep_longest_isoform.pl --gff GCF_018340385.1_ASM1834038v1_genomic.gff -o Cy_carpio.longest_isoform.gff3 1>cyca_long.log 2>&1"

Input file description (downloaded from NCBI)

Expected behavior A clear and concise description of what you expected to happen. 1.I want to make sure whether the output gff file bigger than input gff file is right? 2.which gff file should be input for script agat_sp_keep_longest_isoform.pl?

Screenshots If applicable, add screenshots/sample of file to help explain your problem. image image

Additional context Add any other context about the problem here. Reference any external link/discussion related to the issue (e.g. link to Biostars.org)

Juke34 commented 1 year ago

1) If you want to follow what AGAT is doing you should first run agat_convert_sp_gxf2gxf.pl that will only standardize you file. (Using agat_sp_filter_feature_by_attribute_value.pl or agat_sp_keep_longest_isoform.pl standardize your file and then make some work on it.) From the output the file size might be bigger because by standardizing AGAT can add missing features (e.g. mRNA gene parents if you have only exon/cds; UTR if you have CDS/exon, exons if you have CDS/UTRs, etc...) based on what already exist in oder to have 3 levels organization with e.g. gene>mRNA>exon/cds/utr.

Then you can apply agat_sp_keep_longest_isoform.pl and or agat_sp_filter_feature_by_attribute_value.pl, and you will probably see that the file decrease in size (if you had indeed isoforms in the input file). You should have less lines in your files.

2) Using the original file, agat keep all features and remove only isoforms to keep the longest when there are isoforms. So if you have e.g. repeats feature In the file they will be kept. Shortest Isoforms will be removed for coding protein genes as well as for non-protein coding gene. Using the file filtered by agat_sp_filter_feature_by_attribute_value.pl you have only protein coding genes.