genomeannotation / GAG

Generates an NCBI .tbl file of annotations on a genome.
MIT License
63 stars 20 forks source link

Fix_start_stop does not resolve all issues #197

Open emmannaemeka opened 3 years ago

emmannaemeka commented 3 years ago

I ran the Fix_start_stop on my genome but it does not resolve all

Total sequence length 496122366
Number of genes 19350
Number of mRNAs 19350
Number of exons 119452
Number of introns 100102
Number of CDS 18371
Overlapping genes 384
Contained genes 61
CDS: complete 22
CDS: start, no stop 340
CDS: stop, no start 1205
CDS: no stop, no start 17783
Total gene length 82068175
Total mRNA length 82068175
Total exon length 25049228
Total intron length 57219151
Total CDS length 20883708
Shortest gene 17
Shortest mRNA 17
Shortest exon 1
Shortest intron 4
Shortest CDS 15
Longest gene 207760
Longest mRNA 207760
Longest exon 6759
Longest intron 199438
Longest CDS 10206
mean gene length 4241
mean mRNA length 4241
mean exon length 210
mean intron length 572
mean CDS length 1137
% of genome covered by genes 16.5
% of genome covered by CDS 4.2
mean mRNAs per gene 1
mean exons per mRNA 6
mean introns per mRNA 5

What could be the problem?

Neato-Nick commented 3 years ago

I'm sure it's much too late to be of help, but posting for others who find this issue. I've been preferring AGAT for most GFF processing, it is very actively maintained. agat_sp_fix_CDS_phases.pl will adjust the CDS phase based on errors from intron adjustment, and agat_sp_fix_start_and_stop_codons.pl has a nice output for how many starts/stops can be added. You will have to manually look at genes with no start/stops, otherwise NCBI/tbl2asn will mark the genes as partial products. That's okay too, and common for genes called on contig ends.

davidjstudholme commented 3 years ago

Thank you, @Neato-Nick. That's a really helpful suggestion and I will try it next time.