Closed ipetrushin closed 5 years ago
Howdy Ivan,
I have fixed an error in CasLocusAnno. It can annotate NC_003071 without errors (Please download the latest one). Please note the input file should be whole-protein sequences of NC_003071 instead of its genome nucleotide sequence in the stand-lone version.
I annotated the Cas protein in NC_003071 that you mentioned using CasLocusAnno, there are serveral Cas proteins listed bellow (which can be found in .anno1 and .anno2 files): lcl|NC_003071.7_prot_NP_181756.1_6441 1.68e-09 cd09639.sr cas3 lcl|NC_003071.7_prot_NP_850447.1_7190 9.51e-09 c2c2.sr c2c2 lcl|NC_003071.7_prot_NP_178818.1_1035 1.08e-13 COG1203.sr cas3 lcl|NC_003071.7_prot_NP_001031426.1_3475 1.51e-12 cd06127.sr DEDDh
lcl|NC_003071.7_prot_NP_182150.1_7208 0.003 cd09739.sr cas6f lcl|NC_003071.7_prot_NP_001324382.1_3456 0.004 mkCas0117.sr cmr5gr11 lcl|NC_003071.7_prot_NP_001324383.1_3455 7.15e-05 mkCas0131.sr cas8b5 lcl|NC_003071.7_prot_NP_001323525.1_7205 0.002 Cas14b.sr Cas14 protein belonging to Cas14b lcl|NC_003071.7_prot_NP_850090.1_3476 3.32e-14 cd06127.sr DEDDh lcl|NC_003071.7_prot_NP_566069.1_7206 0.009 mkCas0158.sr cas8b7 However, those are not considered as Cas locus, due to the initial locus is removed by our MCCS procedure. The attachment is inputting whole-protein sequences of NC_003071 and its annotate results. NC_003071.7.zip
Dear Dong! Thanks for fixing. How can I manually check detected proteins via profiles on .sr files? According your suggestion I should find all ORFs in nuclear genome and get their products before using CasLocusAnno?
Dear Ivan,
Thanks for reporting error.
Most of the bacteria are well-annotated in NCBI, therefore it’s easy to obtained their whole-protein sequences. The bacterial proteins can be downloaded from NCBI. Here is a resource https://www.ncbi.nlm.nih.gov/genome/browse/#!/overview/, where you can retrieve bacteria and their FTP hyperlinks. All related information of a bacterium such as genome nucleotide sequences, whole-protein sequences on each chromosome can be found there. Here is another resource ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/ (data stored here is not updated any more) that also stores some bacterial information.
There are two files are ended with “.anno1”, “.anno2” in the three output files, which provide Cas protein information. You can manually check the e-value for each potential Cas protein with .sr file. Generally, the smaller e-value, the more similar between potential Cas and .sr file. The other output that do not ended with “.anno1”, “.anno2” provides Cas locus and its members in it.
The standalone version, CasLocusAnno, is only accepted whole-protein sequences on a chromosome. Thus, before perform annotation based on standalone version, proteins products should be known. However, the web-based version can accept bacterial genome nucleotide sequence. For the genome nucleotide sequence submission in web-based version, the protein sequences are firstly identified by ZCURVE3.0 (http://cefg.uestc.cn/zcurve/download.php). CasLocusAnno is used to annotate Cas based the results from ZCURVE.
Best
BLAST options error: File /home/teacher/CasLocusAnno/temp/NC_003071.7-2chr.fasta.neighbor is empty Traceback (most recent call last): File "CasLocusAnno.py", line 558, in
File "CasLocusAnno.py", line 542, in main
File "CasLocusAnno.py", line 358, in merge_trim
IndexError: list index out of range
No source available to check further for reason. Test files were checked successfully.