Russel88 / CRISPRCasTyper

CCTyper: Automatic detection and subtyping of CRISPR-Cas operons
https://typer.crispr.dk
MIT License
89 stars 16 forks source link

Meaning of asterisk at the end of protein sequence #38

Closed Smiriti-Gupta closed 1 year ago

Smiriti-Gupta commented 1 year ago

Given below are two protein sequences I have taken from protein.faa file obtained after running ccTyper, one with asterisk and other without asterisk at the end of protein sequence.

NODE_20793_length_1284_cov_1096.518308_2 # 431 # 1024 # -1 # ID=82_2;partial=00;start_type=GTG;rbs_motif=AGGA;rbs_spacer=5-10bp;gc_cont=0.577 MKKKILSLAVVAVFGVMTMGPVMAGEVDPATVPEKKQTTLKLYLTAKEAYDMKKAEGDKV LLIDVRTPEEIQYVGNLGDMMDANIPYQFNDISGYDEKKKVYASSLNSNFVAEVEELVNK RGLDKDSTIIVSCRSGDRSAVSANLLAKAGYTHVYSVFDGFEGDLSKDGRRSVNGWKNAG LPWTYNMDKAKMYFILR*

NODE_26395_length_1052_cov_597.440321_1 # 3 # 1052 # -1 # ID=98_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.368 LRRKNINNMIDKIYPYIHKIIKKTFSYLTLPQQKSLALTISAFFDPPSFSLYNIASKLPL DTSNRHKHKHLIRFLDKLLINDDFWKSYITTIFLLPHITSRKKFLTLLIDATTLKDDVWI LSASISYENRAVPIYMELWEGVNQKYDYWARVIGFVRNMRKYLPDKFSYVIIADRGFQGE RLPKEFKKLKLDYIIRIGENYHIKTKNGEEWRELSLLDDGKYNEVVLGKTNSIEGVNVIV SSIKDAENKKHLKWYLMSSIKDMEKEEVVGLYAKRMWIEESFKDLKGKLRWEEYTEKLPK FDRIKKMVIISGLSYGIQLSLGSSKQVVEQRSKGESIIRGLQNALNGVSV

Asterisk in a fasta file generally means a stop codon. My question is does the asterisk mean that the protein sequence is incomplete? Will the prediction that a given protein is a Cas protein be trusted if there is an asterisk at the end of protein sequence? Can I estimate the size of the protein if there is an asterisk in the sequence?

Russel88 commented 1 year ago

The asterisk indicates stop codon yes. If you want to know whether your open reading frame is incomplete you can look at the "partial=XX" info in the protein fasta header. 00 means complete, 10 means start is missing, 01 means end is missing, and 11 means both start and end are missing. An asterisk has no influence on alignments of sequences to Cas models. It's of course harder to trust annotations of incomplete sequences, but it's not easy to give general guidelines as of when they can be trusted.