Closed romseg closed 1 year ago
Several possibilities
To understand what is going on:
If the problem is due to wrong phase you can try agat_sp_fix_cds_phases.pl
You might also try to redefine your CDS within the exon with agat_sp_fix_longest_ORF.pl.
One last thing coming in mind, are you sure to use the correct codon table when translating your sequences?
Dear author,
When comparing the protein sequences before and after agat_sp_keep_longest_isoform.pl, there seems to be in many cases multiple stop codon insertions and frameshifts when using the longest isoforms gff. So I am wondering if the problem is that the coordinate system is not correct? and how to fix this. agat_sp_keep_longest_isoform.pl seems not to make any difference since its output gff still have the same number of genes and features.
The issue may also come from the script that I am using to translate the sequences. For this I am using the joinAnnoFastaFromJoingenes.py script from Braker, and this script seems to work for gff from Braker and Augustus, so it may not work for my gff file that was generated with a different annotation program. The codon table seems to be correct though. Thank you.
Before agat_sp_keep_longest_isoform.pl
>RHC01H1G0003.2
MKKMEEEEENMRREKRLKCINMEEDEDEDEEEGEILDDDDDDDEYEEEIENIEVPSQPVG
FFYPSTTPSSIVVSDALDPDLPVIYVNSAFESSTGYRADEVLGRN
>RHC01H1G0002.2
MAAYAAVTSLLQTLDHLSQTHTSHSLLYKKEQTEVLSEKYTFLKTFLEDFTNIFNEDIKM
KHLERMIQEAANGVEDTIDSHAYDSSVVVQSKRVRRKADMIFHQNLEYAIEEIGLIQREM
RFLSMEESWTLLRDKVFGNGGYPPELEKIGRYIGHQCQGLPLAVVAIGGLLSKMSKETSS
WENVAEKSVSELVHLRYIACPSFNGLVQSLCKLRNIQNLVIHDSHPFGSSMRTQSLPWEV
VNMPQLKHIHTKKLSLFISPPTSVLISERENHLQTLTGLMPSSCNDEVFLRIPNLKKLGI
LIVDESDTIQKCYCLDNLVHLTQLEKLKVET*
>RHC01H1G0005.2
MALSAREWIEPDETAKQFLTRVFSERPFLPLPPPLHRIPLRPGKVVEIVSPSPSSKTRIL
MQAAINCILPKEWKGVNYGGLERLVMFVDLDCRFDVLSLSRLLKQRIIRANDVAFPTSKG
KRSTWQWCLFAYD*
>RHC01H1G0018.2
MKKLRWAMDSGGFWELDLSTPITLNGQARPVPGDPLPLGLSRGSRLSRFQQIDFFQRFMA
MPFVPSFAANRGLLLQRVLSLPIAENWSAILLGQLNVQRFVSSLRKNKTKHLPDSSWLQS
IRRNFIQKSFYALAFCSELFLTPDDTLIISLDAYGDEKVPQKRAVLHHKFPHHNLTMEAA
WPGLFVDGNGSYWDVPFSLALDLASTTLDSGASYHLFFNNCAGSPKQYEGQHSDELPPPA
ALLPGFTAKGVVSLKKNIDLWRSEASMLKMVQPYDIFLSNPHISASWILGAVFSAYLGEN
SMKRQQSCSLRGLKDFDLRAQVANSAVSVDSFASASLTAQHGNFQRLFLDLTRVHTSFDF
PSGSKLLSGITSVACSLYNSQVPNVEALQAICPRASLSFQQQIIGPFSFRVDSEIAIDLK
KDWYLSVKNPVFAIEHALQVLWSAKAVAWYSPMQREFMVELRFFET*
After agat_sp_keep_longest_isoform.pl
>RHC01H1G0003.2
MKKMEEEEENMRREKRLKCINMEEDEDEDEEEGEILDDDDDDDEYEEEIENIEVPSQPVG
FFYPSTTPSSIVVSDALDPDLPVIYVNSAFESSTGYRADEVLGRNX
>RHC01H1G0002.2
MAAYAAVTSLLQTLDHLSQTHTSHSLLYKKEQTEVLSEKYTFLKTFLEDFTNIFNEDIKM
KHLERMIQEAANGVEDTIDSHAYDSSVVVQSKRVRRKADMIFHQNLEYAIEEIGLIQREM
RFLSMEESWTLLRDKVFGNGGYPPELEKIGRYIGHQCQGLPLAVVAIGGLLSKMSKETSS
WENVAEKSVSELVHLRYIACPSFNGLVQSLCKLRNIQNLVIHDSHPFGSSMRTQSLPWEV
VNMPQLKHIHTKKLSLFISPPTSVLISERENHLQTLTGLMPSSCNDEVFLRIPNLKKLGI
LIVDESDTIQKCYCLDNLVHLTQLEKLKVET*
>RHC01H1G0005.2
XVAFPTSKGKRSTWQWCLFAYD*AAINCILPKEWKGVNYGGLERLVMFVDLDCRFDVLSL
SRLLKQRIIRANDGVVGERMD*TRRNGEAISH*GFLRAAISTITSTSSSHSSPSRQGCRN
RQSFSFFKNSHSYAX
>RHC01H1G0018.2
IIGPFSFRVDSEIAIDLKKDWYLSVKNPVFAIEHALQVLWSAKAVAWYSPMQREFMVELR
FFET*VLFFLHISEKIQ*KGNNHAVYGV*KILTFEPK*QILLFQWIHLHLLHSLHSTEIS
KGCSWISLVSIQASTFLQGQNFFLE*PL*HAVFTILKYQMLRHYKQFVHVPHFLFSSSFL
IIILQWRQLGQDFLLMGTGVTGMYHSHSHLILHQQLSTLVLATICFSIIVQVHPSSMKVS
IVMNYHHLLLCFQVSLPKV*FP*RKILTFGEVKRLC*RWCNHMIYSCLILIFQHHGYLVC
YTAWSVKCPEICFFSKKK*N*ASARLLMASIH*KELYPKILLCPCLLFRAVSNTGRYIDH
KFGCLRG*EGASKKSSSSS*DEET*VGYGLRRILGIGLVDSNNSQRPGPASSWRPIALGV
ISRFEAFKVPTNRFLSAFHGHAFRPFFRRQQGPLASTGSFSSYC*KL
Best regards, Rom
Could you give a try to agat_sp_extract_sequences.pl to see if the problem remains?
You should also give a try to the last AGAT version (v1.1.0) I might have fixed some related issues.
I tried agat_sp_extract_sequences.pl and it worked very well. The script extracted all CDS and peptide sequences without errors. Thank you so much for your assistance.
Dear author,
I am trying to filter isoforms from the below potato_gene_models.gff3 file using agat_sp_keep_longest_isoform.pl and then extract the corresponding protein and nucleotide sequences from the assembly, however this gff3 file seems to be in the wrong format for agat_sp_keep_longest_isoform.pl to recognize it properly since a great number of protein and nucleotide sequences don't get translated properly using the output potato_gene_models.longestisoforms.gff3 file. I would appreciate it to know what needs to be fixed in the the potato_gene_models.gff3 file to make it in the right format for the AGAT tools. Thank you.
Input file: potato_gene_models.gff3
After launching agat with singularity: agat_0.9.2--pl5321hdfd78af_1.sif agat_sp_keep_longest_isoform.pl -gff potato_gene_models.gff3 -o potato_gene_models.longestisoforms.gff3
Output file: potato_gene_models.longestisoforms.gff3
Best regards, Rom