How to fix wrong input gff3 file format

romseg commented 1 year ago

Dear author,

I am trying to filter isoforms from the below potato_gene_models.gff3 file using agat_sp_keep_longest_isoform.pl and then extract the corresponding protein and nucleotide sequences from the assembly, however this gff3 file seems to be in the wrong format for agat_sp_keep_longest_isoform.pl to recognize it properly since a great number of protein and nucleotide sequences don't get translated properly using the output potato_gene_models.longestisoforms.gff3 file. I would appreciate it to know what needs to be fixed in the the potato_gene_models.gff3 file to make it in the right format for the AGAT tools. Thank you.

Input file: potato_gene_models.gff3

chr1_1  EVM gene    9904    11015   .   +   .   ID=RHC01H1G0001.2;Name=RHC01H1G0001.2;
chr1_1  EVM mRNA    9904    11015   .   +   .   ID=RHC01H1G0001.2;Parent=RHC01H1G0001.2;
chr1_1  EVM exon    9904    11015   .   +   .   ID=RHC01H1G0001.2.exon;Parent=RHC01H1G0001.2;
chr1_1  EVM CDS 9904    11015   .   +   0   ID=RHC01H1G0001.2.cds;Parent=RHC01H1G0001.2;
chr1_1  EVM gene    15553   18345   .   +   .   ID=RHC01H1G0002.2;Name=RHC01H1G0002.2;
chr1_1  EVM mRNA    15553   18345   .   +   .   ID=RHC01H1G0002.2;Parent=RHC01H1G0002.2;
chr1_1  EVM exon    15553   15909   .   +   .   ID=RHC01H1G0002.2.exon;Parent=RHC01H1G0002.2;
chr1_1  EVM CDS 15553   15909   .   +   0   ID=RHC01H1G0002.2.cds;Parent=RHC01H1G0002.2;
chr1_1  EVM exon    16471   16674   .   +   .   ID=RHC01H1G0002.2.exon;Parent=RHC01H1G0002.2;
chr1_1  EVM CDS 16471   16674   .   +   0   ID=RHC01H1G0002.2.cds;Parent=RHC01H1G0002.2;
chr1_1  EVM exon    17254   17682   .   +   .   ID=RHC01H1G0002.2.exon;Parent=RHC01H1G0002.2;
chr1_1  EVM CDS 17254   17682   .   +   0   ID=RHC01H1G0002.2.cds;Parent=RHC01H1G0002.2;
chr1_1  EVM exon    18340   18345   .   +   .   ID=RHC01H1G0002.2.exon;Parent=RHC01H1G0002.2;
chr1_1  EVM CDS 18340   18345   .   +   0   ID=RHC01H1G0002.2.cds;Parent=RHC01H1G0002.2;
chr1_1  EVM gene    23464   23780   .   +   .   ID=RHC01H1G0003.2;Name=RHC01H1G0003.2;
chr1_1  EVM mRNA    23464   23780   .   +   .   ID=RHC01H1G0003.2;Parent=RHC01H1G0003.2;
chr1_1  EVM exon    23464   23780   .   +   .   ID=RHC01H1G0003.2.exon;Parent=RHC01H1G0003.2;
chr1_1  EVM CDS 23464   23780   .   +   0   ID=RHC01H1G0003.2.cds;Parent=RHC01H1G0003.2;
chr1_1  EVM gene    79265   79966   .   -   .   ID=RHC01H1G0004.2;Name=RHC01H1G0004.2;
chr1_1  EVM mRNA    79265   79966   .   -   .   ID=RHC01H1G0004.2;Parent=RHC01H1G0004.2;
chr1_1  EVM exon    79829   79966   .   -   .   ID=RHC01H1G0004.2.exon;Parent=RHC01H1G0004.2;
chr1_1  EVM CDS 79829   79966   .   -   1   ID=RHC01H1G0004.2.cds;Parent=RHC01H1G0004.2;
chr1_1  EVM exon    79265   79367   .   -   .   ID=RHC01H1G0004.2.exon;Parent=RHC01H1G0004.2;
chr1_1  EVM CDS 79265   79367   .   -   1   ID=RHC01H1G0004.2.cds;Parent=RHC01H1G0004.2;

After launching agat with singularity: agat_0.9.2--pl5321hdfd78af_1.sif agat_sp_keep_longest_isoform.pl -gff potato_gene_models.gff3 -o potato_gene_models.longestisoforms.gff3

Output file: potato_gene_models.longestisoforms.gff3

##gff-version 3
chr8_1  EVM gene    5390    11363   .   +   .   ID=RHC08H1G0001.2;Name=RHC08H1G0001.2
chr8_1  EVM mRNA    5390    11363   .   +   .   ID=nbis-mrna-22663;Parent=RHC08H1G0001.2
chr8_1  EVM exon    5390    5416    .   +   .   ID=RHC08H1G0001.2.exon;Parent=nbis-mrna-22663
chr8_1  EVM exon    8286    8363    .   +   .   ID=nbis-exon-86346;Parent=nbis-mrna-22663
chr8_1  EVM exon    8465    8668    .   +   .   ID=nbis-exon-86347;Parent=nbis-mrna-22663
chr8_1  EVM exon    11312   11363   .   +   .   ID=nbis-exon-86348;Parent=nbis-mrna-22663
chr8_1  EVM CDS 5390    5416    .   +   0   ID=RHC08H1G0001.2.cds;Parent=nbis-mrna-22663
chr8_1  EVM CDS 8286    8363    .   +   0   ID=RHC08H1G0001.2.cds;Parent=nbis-mrna-22663
chr8_1  EVM CDS 8465    8668    .   +   0   ID=RHC08H1G0001.2.cds;Parent=nbis-mrna-22663
chr8_1  EVM CDS 11312   11363   .   +   0   ID=RHC08H1G0001.2.cds;Parent=nbis-mrna-22663
chr8_1  EVM gene    17854   18103   .   -   .   ID=RHC08H1G0002.2;Name=RHC08H1G0002.2
chr8_1  EVM mRNA    17854   18103   .   -   .   ID=nbis-mrna-22664;Parent=RHC08H1G0002.2
chr8_1  EVM exon    17854   17949   .   -   .   ID=nbis-exon-86349;Parent=nbis-mrna-22664
chr8_1  EVM exon    18017   18103   .   -   .   ID=RHC08H1G0002.2.exon;Parent=nbis-mrna-22664
chr8_1  EVM CDS 17854   17949   .   -   0   ID=RHC08H1G0002.2.cds;Parent=nbis-mrna-22664
chr8_1  EVM CDS 18017   18103   .   -   0   ID=RHC08H1G0002.2.cds;Parent=nbis-mrna-22664
chr8_1  EVM gene    19841   20112   .   +   .   ID=RHC08H1G0003.2;Name=RHC08H1G0003.2
chr8_1  EVM mRNA    19841   20112   .   +   .   ID=nbis-mrna-22665;Parent=RHC08H1G0003.2
chr8_1  EVM exon    19841   20112   .   +   .   ID=RHC08H1G0003.2.exon;Parent=nbis-mrna-22665
chr8_1  EVM CDS 19841   20112   .   +   2   ID=RHC08H1G0003.2.cds;Parent=nbis-mrna-22665

Best regards, Rom

Juke34 commented 1 year ago

Several possibilities

the annotation tools used made incomplete/wrong prediction.
the phase are wrong
the coordinate system is not correct (1-based expected but conversion from other format may have introduce error) see https://github.com/Juke34/coordinate_systems

To understand what is going on:

look at the CDS of problematic cases within a genome browser
look at the CDS by extracting location directly in the pasta (e.g. using gaas_fasta_domain_extractor.pl from GAAS )

If the problem is due to wrong phase you can try agat_sp_fix_cds_phases.pl You might also try to redefine your CDS within the exon with agat_sp_fix_longest_ORF.pl.

One last thing coming in mind, are you sure to use the correct codon table when translating your sequences?

romseg commented 1 year ago

Dear author,

When comparing the protein sequences before and after agat_sp_keep_longest_isoform.pl, there seems to be in many cases multiple stop codon insertions and frameshifts when using the longest isoforms gff. So I am wondering if the problem is that the coordinate system is not correct? and how to fix this. agat_sp_keep_longest_isoform.pl seems not to make any difference since its output gff still have the same number of genes and features.

The issue may also come from the script that I am using to translate the sequences. For this I am using the joinAnnoFastaFromJoingenes.py script from Braker, and this script seems to work for gff from Braker and Augustus, so it may not work for my gff file that was generated with a different annotation program. The codon table seems to be correct though. Thank you.

Before agat_sp_keep_longest_isoform.pl

>RHC01H1G0003.2
MKKMEEEEENMRREKRLKCINMEEDEDEDEEEGEILDDDDDDDEYEEEIENIEVPSQPVG
FFYPSTTPSSIVVSDALDPDLPVIYVNSAFESSTGYRADEVLGRN
>RHC01H1G0002.2
MAAYAAVTSLLQTLDHLSQTHTSHSLLYKKEQTEVLSEKYTFLKTFLEDFTNIFNEDIKM
KHLERMIQEAANGVEDTIDSHAYDSSVVVQSKRVRRKADMIFHQNLEYAIEEIGLIQREM
RFLSMEESWTLLRDKVFGNGGYPPELEKIGRYIGHQCQGLPLAVVAIGGLLSKMSKETSS
WENVAEKSVSELVHLRYIACPSFNGLVQSLCKLRNIQNLVIHDSHPFGSSMRTQSLPWEV
VNMPQLKHIHTKKLSLFISPPTSVLISERENHLQTLTGLMPSSCNDEVFLRIPNLKKLGI
LIVDESDTIQKCYCLDNLVHLTQLEKLKVET*
>RHC01H1G0005.2
MALSAREWIEPDETAKQFLTRVFSERPFLPLPPPLHRIPLRPGKVVEIVSPSPSSKTRIL
MQAAINCILPKEWKGVNYGGLERLVMFVDLDCRFDVLSLSRLLKQRIIRANDVAFPTSKG
KRSTWQWCLFAYD*
>RHC01H1G0018.2
MKKLRWAMDSGGFWELDLSTPITLNGQARPVPGDPLPLGLSRGSRLSRFQQIDFFQRFMA
MPFVPSFAANRGLLLQRVLSLPIAENWSAILLGQLNVQRFVSSLRKNKTKHLPDSSWLQS
IRRNFIQKSFYALAFCSELFLTPDDTLIISLDAYGDEKVPQKRAVLHHKFPHHNLTMEAA
WPGLFVDGNGSYWDVPFSLALDLASTTLDSGASYHLFFNNCAGSPKQYEGQHSDELPPPA
ALLPGFTAKGVVSLKKNIDLWRSEASMLKMVQPYDIFLSNPHISASWILGAVFSAYLGEN
SMKRQQSCSLRGLKDFDLRAQVANSAVSVDSFASASLTAQHGNFQRLFLDLTRVHTSFDF
PSGSKLLSGITSVACSLYNSQVPNVEALQAICPRASLSFQQQIIGPFSFRVDSEIAIDLK
KDWYLSVKNPVFAIEHALQVLWSAKAVAWYSPMQREFMVELRFFET*

After agat_sp_keep_longest_isoform.pl

>RHC01H1G0003.2
MKKMEEEEENMRREKRLKCINMEEDEDEDEEEGEILDDDDDDDEYEEEIENIEVPSQPVG
FFYPSTTPSSIVVSDALDPDLPVIYVNSAFESSTGYRADEVLGRNX
>RHC01H1G0002.2
MAAYAAVTSLLQTLDHLSQTHTSHSLLYKKEQTEVLSEKYTFLKTFLEDFTNIFNEDIKM
KHLERMIQEAANGVEDTIDSHAYDSSVVVQSKRVRRKADMIFHQNLEYAIEEIGLIQREM
RFLSMEESWTLLRDKVFGNGGYPPELEKIGRYIGHQCQGLPLAVVAIGGLLSKMSKETSS
WENVAEKSVSELVHLRYIACPSFNGLVQSLCKLRNIQNLVIHDSHPFGSSMRTQSLPWEV
VNMPQLKHIHTKKLSLFISPPTSVLISERENHLQTLTGLMPSSCNDEVFLRIPNLKKLGI
LIVDESDTIQKCYCLDNLVHLTQLEKLKVET*
>RHC01H1G0005.2
XVAFPTSKGKRSTWQWCLFAYD*AAINCILPKEWKGVNYGGLERLVMFVDLDCRFDVLSL
SRLLKQRIIRANDGVVGERMD*TRRNGEAISH*GFLRAAISTITSTSSSHSSPSRQGCRN
RQSFSFFKNSHSYAX
>RHC01H1G0018.2
IIGPFSFRVDSEIAIDLKKDWYLSVKNPVFAIEHALQVLWSAKAVAWYSPMQREFMVELR
FFET*VLFFLHISEKIQ*KGNNHAVYGV*KILTFEPK*QILLFQWIHLHLLHSLHSTEIS
KGCSWISLVSIQASTFLQGQNFFLE*PL*HAVFTILKYQMLRHYKQFVHVPHFLFSSSFL
IIILQWRQLGQDFLLMGTGVTGMYHSHSHLILHQQLSTLVLATICFSIIVQVHPSSMKVS
IVMNYHHLLLCFQVSLPKV*FP*RKILTFGEVKRLC*RWCNHMIYSCLILIFQHHGYLVC
YTAWSVKCPEICFFSKKK*N*ASARLLMASIH*KELYPKILLCPCLLFRAVSNTGRYIDH
KFGCLRG*EGASKKSSSSS*DEET*VGYGLRRILGIGLVDSNNSQRPGPASSWRPIALGV
ISRFEAFKVPTNRFLSAFHGHAFRPFFRRQQGPLASTGSFSSYC*KL

Best regards, Rom

Juke34 commented 1 year ago

Could you give a try to agat_sp_extract_sequences.pl to see if the problem remains?

Juke34 commented 1 year ago

You should also give a try to the last AGAT version (v1.1.0) I might have fixed some related issues.

romseg commented 1 year ago

I tried agat_sp_extract_sequences.pl and it worked very well. The script extracted all CDS and peptide sequences without errors. Thank you so much for your assistance.

NBISweden / AGAT

How to fix wrong input gff3 file format #372