Open FFBrito opened 9 years ago
Hey Frank,
Not part of the team there at LANL but I've been wrestling with building a custom db myself the past couple days. So the files for the genbank and genomes options need to be a .txt file. The txt file will contain a list of all the filenames ending in .gbk from ftp.ncbi.nih.gov/genomes/Viruses (for genomes) and ftp.ncbi.nih.gov/genbank/genomes/Viruses (for genbank). This should get you the necessary files to continue on to building the xml file and then your database.
I'm currently stuck at the gottcha_db.pl part, with the following error:
... Tues 06/02/2015(15:51:17) ------------------------------------------------------------------ Tues 06/02/2015(15:51:17) Importing XML file...done. Not a HASH reference at gottcha_db.pl line 1573. Command exited with non-zero status 255
Any idea what that 'Not a HASH reference' means? I checked in the actual code and can't make much sense of it.
Thanks, Robert
Hi Francisco,
Sorry for my late reply. We are short handed for a while. I just pushed 2 fixes to the github. Let me know how it works.
Thanks, Paul
Hey Robert,
If you follow your own link (ftp.ncbi.nih.gov/genbank/genomes/Viruses), you'll find out it doesn't exit, which is exactly why I'm having problems with the script. I can generate the files just fine if I only use bacterial genomes, but when it comes to viruses, I get an infinite loop.
Hi Paul,
I will try it now and give you some feedback as soon as I have results.
Thank you, Francisco
Hello again, The same thing happened. Here's the command and log:
perl mkGottchaTaxTree.pl --names=/path/names.dmp --nodes=/path/nodes.dmp --genomes=/path/genomes.txt --genbank=/path/genbank.txt --gi2taxid=/path/gi_taxid_nucl.dmp --threads=8
genomes.txt includes all the viral genomes and one bacteria file, added to check if the problem was somehow only related to viruses. genbank.txt only includes one bacteria gbk entry. There's no virus equivalent for this file on either the ftp or by querying directly using eutils. Since nothing is mentioned in the readme, I assume it's expected
log: -> Parsing NODES file "/raid/user2/francisco/GOTTCHA-files/taxdump/nodes.dmp"...done. [16 wallclock secs (15.40 usr + 0.24 sys = 15.64 CPU)] -> Parsing NAMES file "/raid/user2/francisco/GOTTCHA-files/taxdump/names.dmp"...done. [17 wallclock secs (16.68 usr + 0.26 sys = 16.94 CPU)] -> Generating Parent Trace...done. [132 wallclock secs (131.43 usr + 0.72 sys = 132.15 CPU)] -> Performing deep cloning of shared variable "PARENT TRACE"...done. [31 wallclock secs (30.45 usr + 0.68 sys = 31.13 CPU)] -> Storing Parent Trace to disk in BINARY format as "parentTrace.dmp"...done. [22 wallclock secs (19.43 usr + 0.67 sys = 20.10 CPU)] -> Looking for all SPECIES nodes...done. [ 6 wallclock secs ( 6.42 usr + 0.00 sys = 6.42 CPU)] -> Looking for all SUBSPECIES nodes...done. [ 8 wallclock secs ( 8.14 usr + 0.00 sys = 8.14 CPU)] -> Reconstructing the SPECIES Taxonomic Tree....................................done. [261 wallclock secs (261.47 usr + 0.09 sys = 261.56 CPU)] -> Generation of SPECIES tree complete! [488 wallclock secs (483.13 usr + 2.66 sys = 485.79 CPU)] -> Acquiring list of Genbank files from "/raid/user2/francisco/GOTTCHA-files/GOTTCHA-test/genbank.txt"...done. 31 wallclock secs ( 4.10 usr + 7.10 sys = 11.20 CPU) The following files in /raid/user2/francisco/GOTTCHA-files/GOTTCHA-test/genbank.txt were not found on disk:
(genbank entry of the one bacteria i've included - Acaryochloris marina MBIC11017, extracted from ftp.ncbi.nih.gov/genbank/genomes/Bacteria/. On the pre-patched version this didn't happen.)
-> Parsing Genbank files for vital data (each '.' = 250 records).done. 0 wallclock secs ( 0.04 usr + 0.00 sys = 0.04 CPU) -> Acquiring list of Genbank files from "/raid/user2/francisco/GOTTCHA-files/GOTTCHA-test/genomes.txt"...done. 274 wallclock secs (33.47 usr + 65.05 sys = 98.52 CPU)
The following files in /raid/user2/francisco/GOTTCHA-files/GOTTCHA-test/genomes.txt were not found on disk:
(list of all 10142 viral genomes i've supplied are printed here in gbk format.)
-> Parsing Genbank files for vital data (each '.' = 250 records).........................................done. 1 wallclock secs ( 0.69 usr + 0.01 sys = 0.70 CPU)
-> Processing /raid/user2/francisco/GOTTCHA-files/gi_taxid_nucl/gi_taxid_nucl.dmp... ...Determining size of /raid/user2/francisco/GOTTCHA-files/gi_taxid_nucl/gi_taxid_nucl.dmp...done. ...Creating temporary directory for split files...done. ...Splitting /raid/user2/francisco/GOTTCHA-files/gi_taxid_nucl/gi_taxid_nucl.dmp into 8 partitions [tmp001/]...done. ...Processing the partitions...Argument "" isn't numeric in sort at /raid/user2/francisco/software/GOTTCHA-master_fix/bin/mkGottchaTaxTree.pl line 1009. (this might be the cause? what is it though?)
After this I get the following error: Use of uninitialized value $currGI in numeric lt (<) at /raid/user2/francisco/software/GOTTCHA-master_fix/bin/mkGottchaTaxTree.pl line 1070.
Basically, it's the same error on a different line. Could this be related to the fact that I am not getting my virus genbank info from the ftp but querying using eutils? I have to use this because the ftp virus folder is missing some very important entries I need.
Hello,
I'm trying to generate a new taxonomic tree (described on Generating the taxonomic tree and genome vitals references) to use with gottcha using a specific set of viral genomes. I generated a .gbk and .fasta file for those specific viruses which I used to run mkGottchaTaxTree.pl. For the .dmp files I downloaded the sets from the ncbi ftp.
Here's the command I ran: perl mkGottchaTaxTree.pl --names=[path]/names.dmp --nodes=[path]/nodes.dmp --genbank=[path]/ViralGenomes.gbk --genomes=[path]/ViralGenomes.fasta --gi2taxid=[path]/gi_taxid_nucl.dmp --threads=8
Running this seems to generate an infinite cycle, printing out this message: "Use of uninitialized value $currGI in numeric lt (<) at mkGottchaTaxTree.pl line 1037.". In fact, the logfile was ~350 GB worth of this message (I killed it after this).
Any ideas on why this might be happening?
Francisco.