Closed Neato-Nick closed 3 years ago
Could you provide a test case (gff + fasta)?
Sure, attached is a gff with the first 81 genes + corresponding sequence of the genome fasta (soft-masked, hopefully that doesn't matter?)
I don't see any premature stop codon in that test. How did you translate the sequences to make your check?
You can easily check the translation using AGAT:
agat_sp_extract_sequences.pl --gff PHRA15019.head_n773.gff.txt --fasta PHRA15019.head_n773.reformat.fasta.txt -p -o proteins.fa
Ah, maybe it's a translation problem then? Here was my command, I love AGAT and it's what I used :)
agat_sp_extract_sequences.pl -t mrna --protein --cfs -g PHRA15019.head_n773.gff.txt -f PHRA15019.head_n773.reformat.fasta.txt --output proteins.fa
mRNA is a biological concept (it is the mature RNA where the introns have been removed). AGAT looks only what is described in the file. With -t mrna
you say to AGAT to extract the feature type (e.i the line) with "mRNA" in the 3rd column. So it extracts all the sequence from start to the end, it does not devoid introns, it does focus only on the mRNA line (start - end location). So the sequence contains also intron, it is why when you translate it, all the ORFs are broken (except for single exon genes). I will update the help to make it clearer
Oh, so it extracts the features labeled as "mrna" but it doesn't actually extract the mRNAs. To actually extract the mRNAs, I want the translated CDS spliced for each gene?
I wish I had realized this earlier in my workflow but I feel eased that my annotation didn't yield loads of pseudogenes like I was concerned about
On Tue, Oct 20, 2020, 3:56 AM Jacques Dainat notifications@github.com wrote:
mRNA is a biological concept (it is the mature RNA where the introns have been removed). AGAT looks only what is described in the file. With -t mrna you say to AGAT to extract the feature type (e.i the line) with "mRNA" in the 3rd column. So it extracts all the sequence from start to the end, it does not devoid introns, it does focus only on the mRNA line (start - end location). So the sequence contains also intron, it is why when you translate it, all the ORFs are broken (except for single exon genes). I will update the help to make it clearer
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NBISweden/AGAT/issues/80#issuecomment-712767723, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMUDUV5NYS3SQINVEMZ4I3SLVUE7ANCNFSM4SNT6UQQ .
Thanks, I did fresh pull and have tried again. I have another question, when helping me you executed as
agat_sp_extract_sequences.pl --gff PHRA15019.head_n773.gff.txt --fasta PHRA15019.head_n773.reformat.fasta.txt -p -o proteins.fa
This relies on the default for -t
, which is 'cds'. In the documentation for this function, --full
states
"To actually extract an mRNA as it is defined biologicaly you need to use the -t exon
option wihtout the --full option and wihtout the --split option". This advice is repeated under -t
explanation, to use -t exon
to extract mRNA as defined biologically. So, should the command really be
agat_sp_extract_sequences.pl --gff PHRA15019.head_n773.gff.txt --fasta PHRA15019.head_n773.reformat.fasta.txt -p -o proteins.fa -t exon
?
Edit: Just saw your usage examples in the help text for _extract_sequences.pl. Very helpful, thanks!
Yes if you wish to extract the mRNA, but do not forget that mRNA contains UTRs, so translating this sequence does not make any sense.
Thanks for answering my silly question. I guess translating a region that is, by definition, not translated doesn't make sense. Thanks!
Hi,
In the newest release I noticed the script
agat_sp_flag_premature_stop_codons.pl
Running this on my gff and genome file resulted in 0 cases of mRNAs containing premature stop codons, and 0 genes being flagged as pseudogenes. But, there are plenty of transcripts that contain internal stop codons. What exactly is this script looking for? For example, should either of these mrnas have gotten flagged? Both of them are the only isoforms of their respective genes