NBISweden / AGAT

Another Gtf/Gff Analysis Toolkit
GNU General Public License v3.0
462 stars 56 forks source link

500K Gene Models with Many Short Sequences: Valid AGAT Output or Command Error? #495

Open Vijithkumar2020 opened 1 month ago

Vijithkumar2020 commented 1 month ago

This is regarding a de novo genome of a plant that was assembled lately. I used AGAT's feature extraction tool, to get the gene models predicted by AUGUSTUS. The repeat-masked genome is of size 2.6gb, and the fasta file resulted from AGAT's feature extraction file was ~600Mb, comprising 500K gene models. The following command was used for AGAT's feature extraction. I just like to know if this is the right command that was supposed to be used as my output file contains way too many short sequences.

agat_sp_extract_sequences.pl \
--gff /output_file.gff \
--fasta /media/masked.fasta \
--output /out.fasta \
-t gene --split
Juke34 commented 1 month ago

Have you checked the help? https://nbisweden.github.io/AGAT/tools/agat_sp_extract_sequences/#briefly-in-pictures I guess the --split is useless. Then if you want to extract everything from the start of the gene to the end of (So it contains UTR+exon+intron) -t gene is correct. If you want to check what is in your file before to use agat_sp_extract_sequences.pl to be sure you had 500K gene as input in the GFF use agat_sq_stat_basic.pl prior your analyse.