Closed mictadlo closed 4 years ago
AGAT extracts the sequence defined in the GFF by the CDS features, while jbrowse computes on the fly the longest ORF within the exon features. Using the codon table 1 L and M are a correct start codon. So, Apollo extended the 5' side as most as possible.
It added LPVRHYRYFCHHTAAVMEGGDFENGTESPPSAATSPITEQLSALSLNGDIDSPVSVQKPEKDPRKIARKY QMDLCKKALEENVVVYLGTGCGKTHIAVLLIYEMGQLIRKPQKSICVFLAPTVALVQQQAKVIEDSIDFK VGTYCGKSKHLKSHEDWEKE
but it is not what is defined by the CDS in your gff file.
Thank you for your explanations. I used blast2genomegff.py
from genomeGTFtools to map SwissProt proteins to the above gene. However, it appears that the mapping is shorter:
Additionally, I noticed that transcoder predicted from the StringTie output 25 exons but only 23 CDS.
By any chance, do you know how is it possible to get the same results as Jbrowse and StringTie/Transdecoder?
Thank you in advance,
Michal
The proteins mapping is shorter because does not contain UTRs that are not translated. UTRs are the black exons in the above gene model.
If you don't agree with annotation from your GFF file, and wish to have the longest ORF within the exons as done automatically by Jbrowse you could use this script: agat_sp_fix_longest_ORF.pl
.
But I don't see any problem from what you show. Based on you protein evidence, the GFF prediction is good and what you get out from Jbrowse sounds wrong (it's an overprediction of the beginning of the gene, at least no evidence support it)
Thank you for your response. By any chance, do you know why transcoder predicted from the StringTie output 25 exons but only 23 CDS?
Thank you in advance,
Michal
Because two exons are pure UTRs, look at coordinates
NbV1Ch03 transdecoder exon 7087359 7087536 . - . ID=NBlab03G03860.1.exon2;Parent=NBlab03G03860.1
NbV1Ch03 transdecoder exon 7094461 7094668 . - . ID=NBlab03G03860.1.exon1;Parent=NBlab03G03860.1
NbV1Ch03 transdecoder five_prime_UTR 7087359 7087536 . - . ID=NBlab03G03860.1.utr5p2;Parent=NBlab03G03860.1
NbV1Ch03 transdecoder five_prime_UTR 7094461 7094668 . - . ID=NBlab03G03860.1.utr5p1;Parent=NBlab03G03860.1
Thank you for your explanation.
Hi @nathandunn, How is it possible to prevent that Apollo extends the 5' side as most as possible?
Thank you in advance,
Michal
@mictadlo / @Juke34 Just trying to clarify the process.
JBrowse will just provide the evidence and it should exactly match what you put in, but I'm not an expert on that.
If you are creating Apollo annotations, you are right, it will create a longest ORF. How are you loading them?
Individually, there is no way, but if you ALWAYS want them preserved you can use the use_cds_for_new_transcripts
specified here: https://genomearchitect.readthedocs.io/en/latest/Configure.html#main-configuration
If you are loading via the web services or add_features_from_gff3_to_annotations.pl
there is an option to preserve the CDS by passing -X
(this creates a use_cds
option): https://github.com/GMOD/Apollo/blob/develop/tools/data/add_features_from_gff3_to_annotations.pl
If you are using https://pypi.org/project/apollo/ . . .. I don't think they have built in that option yet, though I'm sure they would be happy to have you add it.
Thank you for your comments and clarification @nathandunn. @mictadlo as the original question has been answered I close the issue. To resume: agat_sp_extract_sequences.pl produces exactly the sequence described by the gff3 file. e.g all CDS addded together (end-start+1) gives 4437 nucletotides, divided by 3 = 1479 amino acids (counting the stop codon). So exactly what you have in the output.
Hi, I used
agat_sp_complement_annotations.pl
and got the for example this gene:Next, I extracted amino sequece (1502 bases) with
agat_sp_extract_sequences.pl --gff no_remark.gff3 -f NbV1ChF.fasta -p -o no_remark.AA.fasta
and got the following sequence for the above gene:After loading the above GFF3 file to Apollo/jbrowse it gave me a different sequence (1661 bases) for the same gene.
Why
agat_sp_extract_sequences.pl
extacted a smaller smaller sequcen that Apollo?What did I miss?
Thank you in advance,