RiversDong / CasLocusAnno

CasLocusAnno, annotating Cas proteins, cas loci and their corresponding (sub)types
3 stars 1 forks source link

Checking NC_003071 error #1

Closed ipetrushin closed 5 years ago

ipetrushin commented 5 years ago

BLAST options error: File /home/teacher/CasLocusAnno/temp/NC_003071.7-2chr.fasta.neighbor is empty Traceback (most recent call last): File "CasLocusAnno.py", line 558, in File "CasLocusAnno.py", line 542, in main File "CasLocusAnno.py", line 358, in merge_trim IndexError: list index out of range

No source available to check further for reason. Test files were checked successfully.

RiversDong commented 5 years ago

Howdy Ivan,

I have fixed an error in CasLocusAnno. It can annotate NC_003071 without errors (Please download the latest one). Please note the input file should be whole-protein sequences of NC_003071 instead of its genome nucleotide sequence in the stand-lone version.

I annotated the Cas protein in NC_003071 that you mentioned using CasLocusAnno, there are serveral Cas proteins listed bellow (which can be found in .anno1 and .anno2 files): lcl|NC_003071.7_prot_NP_181756.1_6441 1.68e-09 cd09639.sr cas3 lcl|NC_003071.7_prot_NP_850447.1_7190 9.51e-09 c2c2.sr c2c2 lcl|NC_003071.7_prot_NP_178818.1_1035 1.08e-13 COG1203.sr cas3 lcl|NC_003071.7_prot_NP_001031426.1_3475 1.51e-12 cd06127.sr DEDDh

lcl|NC_003071.7_prot_NP_182150.1_7208 0.003 cd09739.sr cas6f lcl|NC_003071.7_prot_NP_001324382.1_3456 0.004 mkCas0117.sr cmr5gr11 lcl|NC_003071.7_prot_NP_001324383.1_3455 7.15e-05 mkCas0131.sr cas8b5 lcl|NC_003071.7_prot_NP_001323525.1_7205 0.002 Cas14b.sr Cas14 protein belonging to Cas14b lcl|NC_003071.7_prot_NP_850090.1_3476 3.32e-14 cd06127.sr DEDDh lcl|NC_003071.7_prot_NP_566069.1_7206 0.009 mkCas0158.sr cas8b7 However, those are not considered as Cas locus, due to the initial locus is removed by our MCCS procedure. The attachment is inputting whole-protein sequences of NC_003071 and its annotate results. NC_003071.7.zip

ipetrushin commented 5 years ago

Dear Dong! Thanks for fixing. How can I manually check detected proteins via profiles on .sr files? According your suggestion I should find all ORFs in nuclear genome and get their products before using CasLocusAnno?

RiversDong commented 5 years ago

Dear Ivan,

Thanks for reporting error.

Most of the bacteria are well-annotated in NCBI, therefore it’s easy to obtained their whole-protein sequences. The bacterial proteins can be downloaded from NCBI. Here is a resource https://www.ncbi.nlm.nih.gov/genome/browse/#!/overview/, where you can retrieve bacteria and their FTP hyperlinks. All related information of a bacterium such as genome nucleotide sequences, whole-protein sequences on each chromosome can be found there. Here is another resource ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/ (data stored here is not updated any more) that also stores some bacterial information.

There are two files are ended with “.anno1”, “.anno2” in the three output files, which provide Cas protein information. You can manually check the e-value for each potential Cas protein with .sr file. Generally, the smaller e-value, the more similar between potential Cas and .sr file. The other output that do not ended with “.anno1”, “.anno2” provides Cas locus and its members in it.

The standalone version, CasLocusAnno, is only accepted whole-protein sequences on a chromosome. Thus, before perform annotation based on standalone version, proteins products should be known. However, the web-based version can accept bacterial genome nucleotide sequence. For the genome nucleotide sequence submission in web-based version, the protein sequences are firstly identified by ZCURVE3.0 (http://cefg.uestc.cn/zcurve/download.php). CasLocusAnno is used to annotate Cas based the results from ZCURVE.

Best