DuttonLab / kvasir

Searching for Horizontal Gene Transfer
Other
11 stars 5 forks source link

A lot of strange warnings #20

Closed cvn001 closed 7 years ago

cvn001 commented 7 years ago

Dear developer,

When I use kv_import.py to import genbank files, a lot warnings output as blow:

WARNING:root:No locus_tag found for feature - this will cause issues

Are there any serious consequences to the blast searching? Since I downloaded the files from RefSeq database, I have no idea about these warnings.

Best,

Xiangchen

kescobo commented 7 years ago

@cvn001 Yeah, sorry about that. In the most recent version, I've eliminated the scroll of warnings, and it's now a single warning that's generated. I'll be updating PyPi tomorrow.

Can you link to an example RefSeq genome you're using that generates this warning? They should all have locus_tag features, eg:

     CDS             407..1747
                     /gene="dnaA"
                     /locus_tag="BA_0001" <=== this line
                     /old_locus_tag="BA0001"
                     /note="identified by similarity to EGAD:14548; match to
                     protein family HMM PF00308; match to protein family HMM
                     PF08299; match to protein family HMM TIGR00362"
                     /codon_start=1
                     /transl_table=11
                     /product="chromosomal replication initiator protein DnaA"
                     /protein_id="AAP24059.1"

It will not cause serious issues at this stage, though it may complicate your ability to find the genes in the original genome later.

cvn001 commented 7 years ago

Dear Dr. Kevin S. Bonham,

Thanks for your reply. I am very glad to hear the next upgrade of this useful tool.

Actually, I also sent an email to your official mailbox to ask another question. I think I can put the letter here.

I am very interested in microbiome. I want to use Kvasir to detect HGTs among about one thousand bacterial genomes. As I have no experience in running Kvasir, I wish you give me some information about your last job in cheese-associated bacteria, such as memory usage, number of CPU cores and run time etc. If you have any other suggestions, please also share them to me. These will be very helpful for my preparation.

Best,

Xiangchen

kescobo commented 7 years ago

@cvn001 I ran the software for my ~150 genomes on my laptop (~3 year old macbook pro). I suspect that 1000 genomes might give my laptop some issues, but I've never benchmarked it on a remote server, sorry :-(

cvn001 commented 7 years ago

@kescobo Thanks for sharing. I shall try it on my remote server.

Best,

Xiangchen

cvn001 commented 7 years ago

@kescobo Sorry to disturb you again. I have finished to import step. But I failed to use kv_blast to make blast database. I got an error as below:

BLAST Database creation error: FASTA-Reader: No residues given

I guess that there may be something wrong in my genbank files.

I have written a Python script to fetch the problematic genbank file. I got 26 files with empty CDS. These files are consist with the warnings in the attached log file from the first step.

import.txt

The example warning as below:

/home/bioinfo/anaconda2/lib/python2.7/site-packages/Bio/GenBank/__init__.py:1218: BiopythonParserWarning: Expected sequence length 4630065, found 1422462 (NC_014328.1). BiopythonParserWarning)

Could you please check it. Do you think I should remove these files? How to remove these files in the mongodb database?

Best,

Xiangchen

kescobo commented 7 years ago

@cvn001 Sorry for this trouble. I think I've fixed the version problem - but you'll need to uninstall the version of kvasir you currently have, and then reinstall. Assuming you're using pip:

$ pip uninstall kvasirHGT
...
$ pip install kvasirHGT
$ pip freeze | grep kvasirHGT
kvasirHGT==0.6.6

Make sure you have version 0.6.6.

Your inputs should now work without spitting errors. I just set up a new virtual environment and installed from scratch, downloaded one of the genomes in your list from here and got the following:

(test1) KEVINs-MBP-2:test1 ksb$ kv_import.py 2017-05-18 -i ~/Downloads/sequence.gb -v
2017-05-18 19:12:04,405 || INFO: ** Importing /Users/ksb/Downloads/sequence.gb **
Importing Solibacillus silvestris StLB046
(test1) KEVINs-MBP-2:test1 ksb$ 

I recommend starting over. Be sure to clear out your mongo database and/or start a new one so you don't have the stuff left over from the previous import.

As to your last problem, I'm not sure. If it's still happening with the latest version, please feel free to submit a new issue (don't continue this one).

cvn001 commented 7 years ago

@kescobo I have used the latest version (0.6.6) to import all my genomes. The attached log file seems to be fine.

import.txt

Next, I successfully used kv_blast.py to build the blast database.

However, when I try to use kv_blast.py to do blastall job, the program seems to be freezing after blasting the first genome. All information in the log file is shown below:

" 2017-05-20 19:13:44,782 || INFO: blasting Streptomyces bingchenggensis BCW-1

2017-05-20 19:13:51,360 || INFO: Blasting all by all

2017-05-20 19:17:56,972 || INFO: Getting Blast Records "

The status of mongodb and kv_blast is:

" PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 18266 lxc 20 0 20.1g 19g 6904 S 98.9 15.7 97:07.31 mongod

15412 lxc 20 0 405m 79m 4908 S 1.7 0.1 2:26.02 kv_blast.py
" As you can see, the CPU usage of kv_blast.py is very low.

Although the size of DB is increasing in snatches, but no other information output until now. It has been running for half of a day. So I wonder if all the work are still normal?

cvn001 commented 7 years ago

@kescobo Finally, the second genome was blasted. As you can see: " 2017-05-20 19:13:44,782 || INFO: blasting Streptomyces bingchenggensis BCW-1 2017-05-20 19:13:51,360 || INFO: Blasting all by all 2017-05-20 19:17:56,972 || INFO: Getting Blast Records 2017-05-20 21:09:27,881 || INFO: ---> 116555 blast records retrieved 2017-05-20 21:09:27,952 || INFO: blasting Yersinia pseudotuberculosis IP 32953 2017-05-20 21:09:34,490 || INFO: Blasting all by all 2017-05-20 21:10:21,663 || INFO: Getting Blast Records " About 3 hours for one genome. I am afraid it could take me very long time. ):

kescobo commented 7 years ago

I'm not surprised with so many genomes. But as this issue is resolved, I'm going to close it. Please feel free to open another issue if something pops up.

cvn001 commented 7 years ago

OK, thanks a lot.