Missing premature stop codons

Neato-Nick commented 4 years ago

Hi,

In the newest release I noticed the script agat_sp_flag_premature_stop_codons.pl

Running this on my gff and genome file resulted in 0 cases of mRNAs containing premature stop codons, and 0 genes being flagged as pseudogenes. But, there are plenty of transcripts that contain internal stop codons. What exactly is this script looking for? For example, should either of these mrnas have gotten flagged? Both of them are the only isoforms of their respective genes

>sometaxon_6.1 gene=sometaxon_6 seq_id=sometaxon_s0001 type=mrna
MDGGATPAAGTAGTSASAAMIKDPEIKYICGDCGIVNYLSARDPIRCRACGYRIMYKART
KRLIQFEAR*ADGDAQRPTAAGAIA*LCSSGS*YQGLGLEMRGRRHQRNRVAKR
>sometaxon_81.1 gene=sometaxon_81 seq_id=sometaxon_s0001 type=mrna
MMKKEHSMAAASGDPFYVFKECVVRRALP*PPHPCLTLHLPVQ*AGEQSVGCKPDARQVA
RRAGGQRLPRRQGATGTHTPCVLYRLQQSLTNIGKGC*LVVMQRLKAQWRLQTSRSSSWR
RPSSWWRRTEPSLSTLTRYGRRVVEWVHKF*RLGAAKQAEIASRKAFVATTRKVRGG*FA
G*WATGV*WLGL*ELQAVRDEISTDAVKARIRKEERKVRIGTFNFCTGWR*LLAVVTGRS
*CNQRNRRRRSGRPFLDRRGMSAFWPMRRSGSRYERWLRMEKCACN*RRALICEFV*QIM
QEQDQSLAGLQTDITRLHGVTVEISSEVKHQNKYVPQNVFLLD*SELFLTSLSRRMLDDL
NDDVEEAQERMNFVMGRLSKLLKTKGERDFIFCSVRSGEVLKRRVCLLVLQTSASLG*FS
SWWQYSLSWSSWSCTH

Juke34 commented 4 years ago

Could you provide a test case (gff + fasta)?

Neato-Nick commented 4 years ago

Sure, attached is a gff with the first 81 genes + corresponding sequence of the genome fasta (soft-masked, hopefully that doesn't matter?)

Juke34 commented 3 years ago

I don't see any premature stop codon in that test. How did you translate the sequences to make your check?

You can easily check the translation using AGAT: agat_sp_extract_sequences.pl --gff PHRA15019.head_n773.gff.txt --fasta PHRA15019.head_n773.reformat.fasta.txt -p -o proteins.fa

Neato-Nick commented 3 years ago

Ah, maybe it's a translation problem then? Here was my command, I love AGAT and it's what I used :)

agat_sp_extract_sequences.pl -t mrna --protein --cfs -g PHRA15019.head_n773.gff.txt -f PHRA15019.head_n773.reformat.fasta.txt --output proteins.fa

Juke34 commented 3 years ago

mRNA is a biological concept (it is the mature RNA where the introns have been removed). AGAT looks only what is described in the file. With -t mrna you say to AGAT to extract the feature type (e.i the line) with "mRNA" in the 3rd column. So it extracts all the sequence from start to the end, it does not devoid introns, it does focus only on the mRNA line (start - end location). So the sequence contains also intron, it is why when you translate it, all the ORFs are broken (except for single exon genes). I will update the help to make it clearer

Neato-Nick commented 3 years ago

Oh, so it extracts the features labeled as "mrna" but it doesn't actually extract the mRNAs. To actually extract the mRNAs, I want the translated CDS spliced for each gene?

I wish I had realized this earlier in my workflow but I feel eased that my annotation didn't yield loads of pseudogenes like I was concerned about

On Tue, Oct 20, 2020, 3:56 AM Jacques Dainat notifications@github.com wrote:

mRNA is a biological concept (it is the mature RNA where the introns have been removed). AGAT looks only what is described in the file. With -t mrna you say to AGAT to extract the feature type (e.i the line) with "mRNA" in the 3rd column. So it extracts all the sequence from start to the end, it does not devoid introns, it does focus only on the mRNA line (start - end location). So the sequence contains also intron, it is why when you translate it, all the ORFs are broken (except for single exon genes). I will update the help to make it clearer

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NBISweden/AGAT/issues/80#issuecomment-712767723, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMUDUV5NYS3SQINVEMZ4I3SLVUE7ANCNFSM4SNT6UQQ .

Neato-Nick commented 3 years ago

Thanks, I did fresh pull and have tried again. I have another question, when helping me you executed as

agat_sp_extract_sequences.pl --gff PHRA15019.head_n773.gff.txt --fasta PHRA15019.head_n773.reformat.fasta.txt -p -o proteins.fa

This relies on the default for -t, which is 'cds'. In the documentation for this function, --full states "To actually extract an mRNA as it is defined biologicaly you need to use the -t exon option wihtout the --full option and wihtout the --split option". This advice is repeated under -t explanation, to use -t exon to extract mRNA as defined biologically. So, should the command really be

agat_sp_extract_sequences.pl --gff PHRA15019.head_n773.gff.txt --fasta PHRA15019.head_n773.reformat.fasta.txt -p -o proteins.fa -t exon

?

Edit: Just saw your usage examples in the help text for _extract_sequences.pl. Very helpful, thanks!

Juke34 commented 3 years ago

Yes if you wish to extract the mRNA, but do not forget that mRNA contains UTRs, so translating this sequence does not make any sense.

Neato-Nick commented 3 years ago

Thanks for answering my silly question. I guess translating a region that is, by definition, not translated doesn't make sense. Thanks!

NBISweden / AGAT

Missing premature stop codons #80