NBISweden / AGAT

Another Gtf/Gff Analysis Toolkit
GNU General Public License v3.0
447 stars 54 forks source link

agat_sp_extract_sequences.pl produced a shorter AA sequence #20

Closed mictadlo closed 4 years ago

mictadlo commented 4 years ago

Hi, I used agat_sp_complement_annotations.pl and got the for example this gene:

NbV1Ch03        transdecoder    gene    7020898 7094668 .       -       .       ID=NBlab03G03860
NbV1Ch03        transdecoder    mRNA    7020898 7094668 .       -       .       ID=NBlab03G03860.1;Parent=NBlab03G03860
NbV1Ch03        transdecoder    exon    7020898 7021339 .       -       .       ID=NBlab03G03860.1.exon25;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    exon    7022410 7022524 .       -       .       ID=NBlab03G03860.1.exon24;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    exon    7022595 7022904 .       -       .       ID=NBlab03G03860.1.exon23;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    exon    7031340 7031666 .       -       .       ID=NBlab03G03860.1.exon22;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    exon    7034155 7034427 .       -       .       ID=NBlab03G03860.1.exon21;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    exon    7034544 7035002 .       -       .       ID=NBlab03G03860.1.exon20;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    exon    7035940 7036149 .       -       .       ID=NBlab03G03860.1.exon19;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    exon    7036407 7036612 .       -       .       ID=NBlab03G03860.1.exon18;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    exon    7046750 7046844 .       -       .       ID=NBlab03G03860.1.exon17;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    exon    7047693 7048117 .       -       .       ID=NBlab03G03860.1.exon16;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    exon    7048191 7048465 .       -       .       ID=NBlab03G03860.1.exon15;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    exon    7048556 7048812 .       -       .       ID=NBlab03G03860.1.exon14;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    exon    7048972 7049178 .       -       .       ID=NBlab03G03860.1.exon13;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    exon    7049260 7049420 .       -       .       ID=NBlab03G03860.1.exon12;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    exon    7049566 7049758 .       -       .       ID=NBlab03G03860.1.exon11;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    exon    7059727 7059823 .       -       .       ID=NBlab03G03860.1.exon10;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    exon    7063558 7063696 .       -       .       ID=NBlab03G03860.1.exon9;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    exon    7072171 7072299 .       -       .       ID=NBlab03G03860.1.exon8;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    exon    7072381 7072515 .       -       .       ID=NBlab03G03860.1.exon7;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    exon    7072965 7073005 .       -       .       ID=NBlab03G03860.1.exon6;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    exon    7075150 7075222 .       -       .       ID=NBlab03G03860.1.exon5;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    exon    7075307 7075447 .       -       .       ID=NBlab03G03860.1.exon4;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    exon    7076193 7076303 .       -       .       ID=NBlab03G03860.1.exon3;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    exon    7087359 7087536 .       -       .       ID=NBlab03G03860.1.exon2;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    exon    7094461 7094668 .       -       .       ID=NBlab03G03860.1.exon1;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    CDS     7021186 7021339 .       -       1       ID=NBlab03G03860.1.cds23;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    CDS     7022410 7022524 .       -       2       ID=NBlab03G03860.1.cds22;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    CDS     7022595 7022904 .       -       0       ID=NBlab03G03860.1.cds21;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    CDS     7031340 7031666 .       -       0       ID=NBlab03G03860.1.cds20;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    CDS     7034155 7034427 .       -       0       ID=NBlab03G03860.1.cds19;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    CDS     7034544 7035002 .       -       0       ID=NBlab03G03860.1.cds18;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    CDS     7035940 7036149 .       -       0       ID=NBlab03G03860.1.cds17;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    CDS     7036407 7036612 .       -       2       ID=NBlab03G03860.1.cds16;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    CDS     7046750 7046844 .       -       1       ID=NBlab03G03860.1.cds15;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    CDS     7047693 7048117 .       -       0       ID=NBlab03G03860.1.cds14;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    CDS     7048191 7048465 .       -       2       ID=NBlab03G03860.1.cds13;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    CDS     7048556 7048812 .       -       1       ID=NBlab03G03860.1.cds12;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    CDS     7048972 7049178 .       -       1       ID=NBlab03G03860.1.cds11;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    CDS     7049260 7049420 .       -       0       ID=NBlab03G03860.1.cds10;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    CDS     7049566 7049758 .       -       1       ID=NBlab03G03860.1.cds9;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    CDS     7059727 7059823 .       -       2       ID=NBlab03G03860.1.cds8;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    CDS     7063558 7063696 .       -       0       ID=NBlab03G03860.1.cds7;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    CDS     7072171 7072299 .       -       0       ID=NBlab03G03860.1.cds6;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    CDS     7072381 7072515 .       -       0       ID=NBlab03G03860.1.cds5;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    CDS     7072965 7073005 .       -       2       ID=NBlab03G03860.1.cds4;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    CDS     7075150 7075222 .       -       0       ID=NBlab03G03860.1.cds3;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    CDS     7075307 7075447 .       -       0       ID=NBlab03G03860.1.cds2;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    CDS     7076193 7076207 .       -       0       ID=NBlab03G03860.1.cds1;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    five_prime_UTR  7076208 7076303 .       -       .       ID=NBlab03G03860.1.utr5p3;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    five_prime_UTR  7087359 7087536 .       -       .       ID=NBlab03G03860.1.utr5p2;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    five_prime_UTR  7094461 7094668 .       -       .       ID=NBlab03G03860.1.utr5p1;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    three_prime_UTR 7020898 7021185 .       -       .       ID=NBlab03G03860.1.utr3p1;Parent=NBlab03G03860.1

Next, I extracted amino sequece (1502 bases) with agat_sp_extract_sequences.pl --gff no_remark.gff3 -f NbV1ChF.fasta -p -o no_remark.AA.fasta and got the following sequence for the above gene:

>NBlab03G03860.1 gene=NBlab03G03860 seq_id=NbV1Ch03 type=cds
MEQYEVLVMTPQILLHNLSHCYIRIEFIALLIFDECHYAQVESDHPYAEIMKIFYKPDVV
KLPRIFGMTASPISGKGATVEGLETLLRSKVYSVEDKDELEQFVASPKVNVYYYGPGTGC
LTKAYSQKLEEIKHQCVMVLHKKAVDHSTLRNTKKMLKRLHGHLIFSLENLGVFGALQAS
CILLKGDRYERHQMVEADVNASDDSLCDRYLSQVATVFTSGCAKDGMNPDLTLVEVMKEP
YFSKKLLLLIGILSNFGVQPDMKCIIFVNRIVTARSLSYMLQHLKVLSSWKCGFLVGVHS
GLKSMSRKNTNIILNKFRSGELNLLVATKVGEEGLDIQTCCLVIRFDLPETVASFIQSRG
RARMPKSEYAFLVDSDNQRELNLIEHFSRNEARMNDEISSRKSCTAVIDFQENIYKVDMT
GATISSASSISLLHHYCSKLPRDEFFCPKPQFFYFDDIDGTICKLVLPSNAPMHQIVSAP
QSSIDAAKKDACLRACKSLHELGALTDYLLPDQADEDLIHVFSDSESSDGEDAREELHEM
IVPAAFKEPWTETESHVCLNSYYINFSPCPIDRVYKKFGLFLKAPLPQEAERMKLDLNLA
RGRSVETELIPSGATNFENNEVQLAEKFQKMFLKIILDRSEFISEFVSLEKQDYVDSASK
YYLLLPINLCGHNKISVDWELVRRCLSSPIFGTSVYAGNSEISKFDEQLQLANGSKSVHD
VANSLVYVPCKDTFFFISDVVKESNAYSIYKDSKNHVEHYYDTFDIHLLYPEQPLIKAKQ
LFCLDNLLRKKGYSELRDKEEHFVELPAEICQLKIIGFSKDIGSSLSLLPSIMHRLESLL
IAIELKSCLSASFPEGREVTIDHVLEALTTEKCNEPFSLERLEVLGDAFLKFAVGRHVFL
TYEAFDEGQLTRRRSNIVNNSYLYSIAVRSNLQAFIRDQSFYPNNFYAVGRPCPEICNKQ
TENSIHGQCGNVTDGAKTEVRCSKCHQWLRKKTIADIVEALVGAFVVDSGFKAAIAFLKW
IGIYTDFEESQVKSICAASKIFIPLADEIDIPAIENLLGYSFVHKGLLIQAFIHPSYNNH
GGGCYQRLEFLGDAALDYLITSYLYSVYPKLKPGQLTDLRSVSVNNNTFAAVAVHQSFHS
HILCDSSGLRESITRYVNFIGRPDSMKRLAEEPSCPKALGDLVESCMGAILLDTGFDLNR
AWHIMLSLLKPVMSFTRLQLNPKRELHELCQSYGWHLKFLASKKDSKYLVEAKVNGENVS
DAASALNINKKAAARMAAQQVHSSLKAQGYRRKSKSLEQVVKTAKKMEAKLIGYDEIPCV
LTARCNDVEKNEASESDRDLKVFHISEELARNCNFKLKPARKLAPEIAVRCNSEQTIMPN
GSNSDSKAQGGAINGSAKSILHEVCAANCWKPPRFECCKETGPSHLKEFTFRVVVEIEET
SRVIESCGAPRAKKKDAAEDAAEGALWFLKHEGYVFDN*

After loading the above GFF3 file to Apollo/jbrowse it gave me a different sequence (1661 bases) for the same gene.

LPVRHYRYFCHHTAAVMEGGDFENGTESPPSAATSPITEQLSALSLNGDIDSPVSVQKPEKDPRKIARKY
QMDLCKKALEENVVVYLGTGCGKTHIAVLLIYEMGQLIRKPQKSICVFLAPTVALVQQQAKVIEDSIDFK
VGTYCGKSKHLKSHEDWEKEMEQYEVLVMTPQILLHNLSHCYIRIEFIALLIFDECHYAQVESDHPYAEI
MKIFYKPDVVKLPRIFGMTASPISGKGATVEGLETLLRSKVYSVEDKDELEQFVASPKVNVYYYGPGTGC
LTKAYSQKLEEIKHQCVMVLHKKAVDHSTLRNTKKMLKRLHGHLIFSLENLGVFGALQASCILLKGDRYE
RHQMVEADVNASDDSLCDRYLSQVATVFTSGCAKDGMNPDLTLVEVMKEPYFSKKLLLLIGILSNFGVQP
DMKCIIFVNRIVTARSLSYMLQHLKVLSSWKCGFLVGVHSGLKSMSRKNTNIILNKFRSGELNLLVATKV
GEEGLDIQTCCLVIRFDLPETVASFIQSRGRARMPKSEYAFLVDSDNQRELNLIEHFSRNEARMNDEISS
RKSCTAVIDFQENIYKVDMTGATISSASSISLLHHYCSKLPRDEFFCPKPQFFYFDDIDGTICKLVLPSN
APMHQIVSAPQSSIDAAKKDACLRACKSLHELGALTDYLLPDQADEDLIHVFSDSESSDGEDAREELHEM
IVPAAFKEPWTETESHVCLNSYYINFSPCPIDRVYKKFGLFLKAPLPQEAERMKLDLNLARGRSVETELI
PSGATNFENNEVQLAEKFQKMFLKIILDRSEFISEFVSLEKQDYVDSASKYYLLLPINLCGHNKISVDWE
LVRRCLSSPIFGTSVYAGNSEISKFDEQLQLANGSKSVHDVANSLVYVPCKDTFFFISDVVKESNAYSIY
KDSKNHVEHYYDTFDIHLLYPEQPLIKAKQLFCLDNLLRKKGYSELRDKEEHFVELPAEICQLKIIGFSK
DIGSSLSLLPSIMHRLESLLIAIELKSCLSASFPEGREVTIDHVLEALTTEKCNEPFSLERLEVLGDAFL
KFAVGRHVFLTYEAFDEGQLTRRRSNIVNNSYLYSIAVRSNLQAFIRDQSFYPNNFYAVGRPCPEICNKQ
TENSIHGQCGNVTDGAKTEVRCSKCHQWLRKKTIADIVEALVGAFVVDSGFKAAIAFLKWIGIYTDFEES
QVKSICAASKIFIPLADEIDIPAIENLLGYSFVHKGLLIQAFIHPSYNNHGGGCYQRLEFLGDAALDYLI
TSYLYSVYPKLKPGQLTDLRSVSVNNNTFAAVAVHQSFHSHILCDSSGLRESITRYVNFIGRPDSMKRLA
EEPSCPKALGDLVESCMGAILLDTGFDLNRAWHIMLSLLKPVMSFTRLQLNPKRELHELCQSYGWHLKFL
ASKKDSKYLVEAKVNGENVSDAASALNINKKAAARMAAQQVHSSLKAQGYRRKSKSLEQVVKTAKKMEAK
LIGYDEIPCVLTARCNDVEKNEASESDRDLKVFHISEELARNCNFKLKPARKLAPEIAVRCNSEQTIMPN
GSNSDSKAQGGAINGSAKSILHEVCAANCWKPPRFECCKETGPSHLKEFTFRVVVEIEETSRVIESCGAP
RAKKKDAAEDAAEGALWFLKHEGYVFDN

Why agat_sp_extract_sequences.pl extacted a smaller smaller sequcen that Apollo?

What did I miss?

Thank you in advance,

Juke34 commented 4 years ago

AGAT extracts the sequence defined in the GFF by the CDS features, while jbrowse computes on the fly the longest ORF within the exon features. Using the codon table 1 L and M are a correct start codon. So, Apollo extended the 5' side as most as possible. It added LPVRHYRYFCHHTAAVMEGGDFENGTESPPSAATSPITEQLSALSLNGDIDSPVSVQKPEKDPRKIARKY QMDLCKKALEENVVVYLGTGCGKTHIAVLLIYEMGQLIRKPQKSICVFLAPTVALVQQQAKVIEDSIDFK VGTYCGKSKHLKSHEDWEKE but it is not what is defined by the CDS in your gff file.

mictadlo commented 4 years ago

Thank you for your explanations. I used blast2genomegff.py from genomeGTFtools to map SwissProt proteins to the above gene. However, it appears that the mapping is shorter:

Screen Shot 2020-02-20 at 11 18 25 AM

Additionally, I noticed that transcoder predicted from the StringTie output 25 exons but only 23 CDS.

By any chance, do you know how is it possible to get the same results as Jbrowse and StringTie/Transdecoder?

Thank you in advance,

Michal

Juke34 commented 4 years ago

The proteins mapping is shorter because does not contain UTRs that are not translated. UTRs are the black exons in the above gene model. If you don't agree with annotation from your GFF file, and wish to have the longest ORF within the exons as done automatically by Jbrowse you could use this script: agat_sp_fix_longest_ORF.pl.
But I don't see any problem from what you show. Based on you protein evidence, the GFF prediction is good and what you get out from Jbrowse sounds wrong (it's an overprediction of the beginning of the gene, at least no evidence support it)

mictadlo commented 4 years ago

Thank you for your response. By any chance, do you know why transcoder predicted from the StringTie output 25 exons but only 23 CDS?

Thank you in advance,

Michal

Juke34 commented 4 years ago

Because two exons are pure UTRs, look at coordinates

NbV1Ch03        transdecoder    exon    7087359 7087536 .       -       .       ID=NBlab03G03860.1.exon2;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    exon    7094461 7094668 .       -       .       ID=NBlab03G03860.1.exon1;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    five_prime_UTR  7087359 7087536 .       -       .       ID=NBlab03G03860.1.utr5p2;Parent=NBlab03G03860.1
NbV1Ch03        transdecoder    five_prime_UTR  7094461 7094668 .       -       .       ID=NBlab03G03860.1.utr5p1;Parent=NBlab03G03860.1
mictadlo commented 4 years ago

Thank you for your explanation.

mictadlo commented 4 years ago

Hi @nathandunn, How is it possible to prevent that Apollo extends the 5' side as most as possible?

Thank you in advance,

Michal

nathandunn commented 4 years ago

@mictadlo / @Juke34 Just trying to clarify the process.

JBrowse will just provide the evidence and it should exactly match what you put in, but I'm not an expert on that.

If you are creating Apollo annotations, you are right, it will create a longest ORF. How are you loading them?

Individually, there is no way, but if you ALWAYS want them preserved you can use the use_cds_for_new_transcripts specified here: https://genomearchitect.readthedocs.io/en/latest/Configure.html#main-configuration

If you are loading via the web services or add_features_from_gff3_to_annotations.pl there is an option to preserve the CDS by passing -X (this creates a use_cds option): https://github.com/GMOD/Apollo/blob/develop/tools/data/add_features_from_gff3_to_annotations.pl

If you are using https://pypi.org/project/apollo/ . . .. I don't think they have built in that option yet, though I'm sure they would be happy to have you add it.

Juke34 commented 4 years ago

Thank you for your comments and clarification @nathandunn. @mictadlo as the original question has been answered I close the issue. To resume: agat_sp_extract_sequences.pl produces exactly the sequence described by the gff3 file. e.g all CDS addded together (end-start+1) gives 4437 nucletotides, divided by 3 = 1479 amino acids (counting the stop codon). So exactly what you have in the output.