AstrobioMike / GToTree

A user-friendly workflow for phylogenomics
GNU General Public License v3.0
199 stars 25 forks source link

Connection issue? #78

Closed corkdagga closed 1 year ago

corkdagga commented 1 year ago

Hi,

I have just iinstalled Gtotree on a HPC using conda.

After running gtt-test.sh, I get the following output (below). I guess it is some kind of connection issue because there are multiple errors. I noticed issue #32 when trying to download from NCBI, the second option in issue #32 worked for me, but I have no idea how to apply that to gtt-test.sh, since the commands are run automatically. Is there some kind of option or commands I can input to permanently fix my issue?

Any help here would be great! Thanks!

(gtotree) pada358b@tauruslogin5:/beegfs/.global0/ws/pada358b-conda$ gtt-test.sh

Downloading GToTree test data into the subdirectory GToTree-test-data/

Data being pulled from here: https://figshare.com/articles/dataset/GToTree_test_data/19372334

% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 7163k 100 7163k 0 0 9339k 0 --:--:-- --:--:-- --:--:-- 24.7M

Running test as: GToTree -a GToTree-test-data/ncbi_accessions.txt \ -g GToTree-test-data/genbank_files.txt \ -f GToTree-test-data/fasta_files.txt \ -A GToTree-test-data/amino_acid_files.txt \ -m GToTree-test-data/genome_to_id_map.tsv \ -p GToTree-test-data/pfam_targets.txt \ -H Universal -t -D -j 4 -o GToTree-test-output

The test run includes some things that shouldn't be found, so don't be alarmed when seeing those messages.

Starting run now:

                              GToTree v1.6.34
                     (github.com/AstrobioMike/GToTree)

Downloading required NCBI taxonomy data (only needs to be done once)...

Traceback (most recent call last): File "/beegfs/ws/0/pada358b-conda/envs/gtotree/lib/python3.7/urllib/request.py", line 1566, in ftp_open fw = self.connect_ftp(user, passwd, host, port, dirs, req.timeout) File "/beegfs/ws/0/pada358b-conda/envs/gtotree/lib/python3.7/urllib/request.py", line 1588, in connect_ftp persistent=False) File "/beegfs/ws/0/pada358b-conda/envs/gtotree/lib/python3.7/urllib/request.py", line 2408, in init self.init() File "/beegfs/ws/0/pada358b-conda/envs/gtotree/lib/python3.7/urllib/request.py", line 2417, in init self.ftp.connect(self.host, self.port, self.timeout) File "/beegfs/ws/0/pada358b-conda/envs/gtotree/lib/python3.7/ftplib.py", line 154, in connect source_address=self.source_address) File "/beegfs/ws/0/pada358b-conda/envs/gtotree/lib/python3.7/socket.py", line 728, in create_connection raise err File "/beegfs/ws/0/pada358b-conda/envs/gtotree/lib/python3.7/socket.py", line 716, in create_connection sock.connect(sa) ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/beegfs/ws/0/pada358b-conda/envs/gtotree/bin/gtt-get-ncbi-tax-data", line 123, in main() File "/beegfs/ws/0/pada358b-conda/envs/gtotree/bin/gtt-get-ncbi-tax-data", line 41, in main get_NCBI_tax_data(NCBI_data_dir) File "/beegfs/ws/0/pada358b-conda/envs/gtotree/bin/gtt-get-ncbi-tax-data", line 110, in get_NCBI_tax_data urllib.request.urlretrieve("ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz", taxdump_path) File "/beegfs/ws/0/pada358b-conda/envs/gtotree/lib/python3.7/urllib/request.py", line 247, in urlretrieve with contextlib.closing(urlopen(url, data)) as fp: File "/beegfs/ws/0/pada358b-conda/envs/gtotree/lib/python3.7/urllib/request.py", line 222, in urlopen return opener.open(url, data, timeout) File "/beegfs/ws/0/pada358b-conda/envs/gtotree/lib/python3.7/urllib/request.py", line 525, in open response = self._open(req, data) File "/beegfs/ws/0/pada358b-conda/envs/gtotree/lib/python3.7/urllib/request.py", line 543, in _open '_open', req) File "/beegfs/ws/0/pada358b-conda/envs/gtotree/lib/python3.7/urllib/request.py", line 503, in _call_chain result = func(*args) File "/beegfs/ws/0/pada358b-conda/envs/gtotree/lib/python3.7/urllib/request.py", line 1584, in ftp_open raise exc.with_traceback(sys.exc_info()[2]) File "/beegfs/ws/0/pada358b-conda/envs/gtotree/lib/python3.7/urllib/request.py", line 1566, in ftp_open fw = self.connect_ftp(user, passwd, host, port, dirs, req.timeout) File "/beegfs/ws/0/pada358b-conda/envs/gtotree/lib/python3.7/urllib/request.py", line 1588, in connect_ftp persistent=False) File "/beegfs/ws/0/pada358b-conda/envs/gtotree/lib/python3.7/urllib/request.py", line 2408, in init self.init() File "/beegfs/ws/0/pada358b-conda/envs/gtotree/lib/python3.7/urllib/request.py", line 2417, in init self.ftp.connect(self.host, self.port, self.timeout) File "/beegfs/ws/0/pada358b-conda/envs/gtotree/lib/python3.7/ftplib.py", line 154, in connect source_address=self.source_address) File "/beegfs/ws/0/pada358b-conda/envs/gtotree/lib/python3.7/socket.py", line 728, in create_connection raise err File "/beegfs/ws/0/pada358b-conda/envs/gtotree/lib/python3.7/socket.py", line 716, in create_connection sock.connect(sa) urllib.error.URLError: <urlopen error ftp error: ConnectionRefusedError(111, 'Connection refused')>

--------------------------------- RUN INFO ---------------------------------

Input genome sources include:
  - NCBI accessions listed in GToTree-test-data/ncbi_accessions.txt (8 genomes)
  - Genbank files listed in GToTree-test-data/genbank_files.txt (2 genomes)
  - Fasta files listed in GToTree-test-data/fasta_files.txt (2 genomes)
  - Amino-acid files listed in GToTree-test-data/amino_acid_files.txt (2 genomes)

                         Total input genomes: 14

HMM source to be used:
  - Universal_Hug_et_al (16 targets)

Options set:
  - The output directory has been set to "GToTree-test-output/"
  - The file "GToTree-test-data/genome_to_id_map.tsv" will be used to modify labels of the specified genomes
  - GTDB taxonomic info will be added to labels where possible
  - NCBI taxonomic info will be added where possible when GTDB is not
  - Number of jobs to run during parallelizable steps has been set to 4
  - Pfams will be searched from: GToTree-test-data/pfam_targets.txt (2 targets)

** NOTICE **
Filtering by gene-length using the median length of a gene set (set with the -c flag) becomes less reliable with fewer genomes. With 14 total input genomes, if a lot of sequences are dropped, consider increasing the parameter and/or visually inspecting the alignments.

More info can be found here:
  github.com/AstrobioMike/GToTree/wiki/Things-to-consider

            Moving forward with "-c" set to 0.2 this run.

##############################################################################

Downloading HMMs for additional Pfam targets

##############################################################################

/beegfs/ws/0/pada358b-conda/envs/gtotree/bin/gtt-get-additional-pfam-targets.sh: line 22: [: PF00238.19: unary operator expected cat: 1690194970.gtotree.tmpdir/.hmm: No such file or directory /beegfs/ws/0/pada358b-conda/envs/gtotree/bin/gtt-get-additional-pfam-targets.sh: line 22: [: PF05400.13: unary operator expected cat: 1690194970.gtotree.tmpdir/.hmm: No such file or directory

##############################################################################

Working on the genomes provided as NCBI accessions

##############################################################################

              Downloading GenBank assembly summaries...

% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 curl: (7) Failed to connect to ftp.ncbi.nlm.nih.gov port 21 after 112 ms: Connection refused

Download of NCBI assembly summaries failed :( Is the internet connection weak?

Exiting for now.


Test completed! See here for how things should look: https://github.com/AstrobioMike/GToTree/wiki/Installation#test-run

You can clear out the test data and results by running: gtt-clean-after-test.sh

(gtotree) pada358b@tauruslogin5:/beegfs/.global0/ws/pada358b-conda$

AstrobioMike commented 1 year ago

Hi there, @corkdagga!

Thanks for the note. There is indeed no current way to set that for the test run, which was a silly oversight of me to make. I will add that in ASAP. But updating to a newer version might help. Do you know if there's a reason it was using v1.6.34? We're currently up to v1.8.1, and there have been some improvements to the downloading and download-error handling since then that I think might solve this.

If you wouldn't mind, i'd be interested to know if it happens still if trying with the latest version. You should be able to install it with the following (modified environment name just to not overwrite your current one):

# if you don't have mamba already
conda install -n base -c conda-forge mamba

# creating new gtotree env with v1.8.1
mamba create -n gtotree-1.8.1 -c astrobiomike -c conda-forge -c bioconda -c defaults gtotree=1.8.1

conda activate gtotree-1.8.1
gtt-test.sh

I'll update you when i add the ability to set HTTP with the test run, thanks for the note about that!

corkdagga commented 1 year ago

Hi!

Thanks for the help! Regarding the version, I used the following commands:

conda create -y -n gtotree python=3.7 conda activate gtotree conda install -c astrobiomike -c conda-forge -c bioconda -c defaults gtotree

Which was also done using conda version 4.X.X (I can't remember the exact version sorry, but it certainly isn't the newest version of conda). And the reason for that is I am trying to install it on my institution's HPC and I am unable to update conda to the latest version. That is also the reason I did not install with mamba, I am not sure I will be able to. I will try it now along with your other suggestions and let you know the result.

Thanks!

AstrobioMike commented 1 year ago

Ah, thanks, yea if installing mamba is a problem, you can just run the same create command just with conda in front instead of mamba 👍

I think the python specification likely caused the earlier version, hopefully

Did you find that example install somewhere? I should fix it if so. or add a note as to why it's that way in that location if it is for a reason there

Never mind, I found the spot. Remembered someone else wrote that in for me, I updated that there to python=3.9

So if you need to do it in two steps, that should still work now (installing a later python at first as changed on this page: https://github.com/AstrobioMike/GToTree/wiki/installation#done)

corkdagga commented 1 year ago

Hi,

I installed using conda as suggested and got the following output:

pada358b@tauruslogin5:/beegfs/ws/0/pada358b-conda$ conda activate gtotree-1.8.1 (gtotree-1.8.1) pada358b@tauruslogin5:/beegfs/ws/0/pada358b-conda$ gtt-test.sh

Downloading GToTree test data into the subdirectory GToTree-test-data/

Test data being pulled from here: https://zenodo.org/record/7860720#.ZEcWkexlA_8

% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 7163k 100 7163k 0 0 1477k 0 0:00:04 0:00:04 --:--:-- 1477k

Running test as: GToTree -a GToTree-test-data/ncbi_accessions.txt \ -g GToTree-test-data/genbank_files.txt \ -f GToTree-test-data/fasta_files.txt \ -A GToTree-test-data/amino_acid_files.txt \ -m GToTree-test-data/genome_to_id_map.tsv \ -p GToTree-test-data/pfam_targets.txt \ -H Universal -t -D -j 4 -o GToTree-test-output -F

The test run includes some things that shouldn't be found, so don't be alarmed when seeing those messages.

Starting run now:

                              GToTree v1.8.1
                     (github.com/AstrobioMike/GToTree)

Downloading NCBI assembly summaries (only done once, or updated after 4 weeks)...

Traceback (most recent call last): File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 1563, in ftp_open fw = self.connect_ftp(user, passwd, host, port, dirs, req.timeout) File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 1584, in connect_ftp return ftpwrapper(user, passwd, host, port, dirs, timeout, File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 2405, in init self.init() File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 2414, in init self.ftp.connect(self.host, self.port, self.timeout) File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/ftplib.py", line 158, in connect self.sock = socket.create_connection((self.host, self.port), self.timeout, File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/socket.py", line 844, in create_connection raise err File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/socket.py", line 832, in create_connection sock.connect(sa) ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/bin/gtt-get-ncbi-assembly-tables", line 177, in main() File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/bin/gtt-get-ncbi-assembly-tables", line 43, in main get_NCBI_assembly_summary_data(NCBI_assembly_data_dir) File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/bin/gtt-get-ncbi-assembly-tables", line 153, in get_NCBI_assembly_summary_data urllib.request.urlretrieve(genbank_link, table_path) File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 239, in urlretrieve with contextlib.closing(urlopen(url, data)) as fp: File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 214, in urlopen return opener.open(url, data, timeout) File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 517, in open response = self._open(req, data) File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 534, in _open result = self._call_chain(self.handle_open, protocol, protocol + File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 494, in _call_chain result = func(*args) File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 1581, in ftp_open raise exc.with_traceback(sys.exc_info()[2]) File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 1563, in ftp_open fw = self.connect_ftp(user, passwd, host, port, dirs, req.timeout) File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 1584, in connect_ftp return ftpwrapper(user, passwd, host, port, dirs, timeout, File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 2405, in init self.init() File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 2414, in init self.ftp.connect(self.host, self.port, self.timeout) File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/ftplib.py", line 158, in connect self.sock = socket.create_connection((self.host, self.port), self.timeout, File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/socket.py", line 844, in create_connection raise err File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/socket.py", line 832, in create_connection sock.connect(sa) urllib.error.URLError: <urlopen error ftp error: ConnectionRefusedError(111, 'Connection refused')> Downloading required NCBI taxonomy data (only needs to be done once)...

Traceback (most recent call last): File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 1563, in ftp_open fw = self.connect_ftp(user, passwd, host, port, dirs, req.timeout) File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 1584, in connect_ftp return ftpwrapper(user, passwd, host, port, dirs, timeout, File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 2405, in init self.init() File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 2414, in init self.ftp.connect(self.host, self.port, self.timeout) File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/ftplib.py", line 158, in connect self.sock = socket.create_connection((self.host, self.port), self.timeout, File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/socket.py", line 844, in create_connection raise err File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/socket.py", line 832, in create_connection sock.connect(sa) ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/bin/gtt-get-ncbi-tax-data", line 123, in main() File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/bin/gtt-get-ncbi-tax-data", line 41, in main get_NCBI_tax_data(NCBI_data_dir) File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/bin/gtt-get-ncbi-tax-data", line 110, in get_NCBI_tax_data urllib.request.urlretrieve("ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz", taxdump_path) File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 239, in urlretrieve with contextlib.closing(urlopen(url, data)) as fp: File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 214, in urlopen return opener.open(url, data, timeout) File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 517, in open response = self._open(req, data) File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 534, in _open result = self._call_chain(self.handle_open, protocol, protocol + File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 494, in _call_chain result = func(*args) File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 1581, in ftp_open raise exc.with_traceback(sys.exc_info()[2]) File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 1563, in ftp_open fw = self.connect_ftp(user, passwd, host, port, dirs, req.timeout) File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 1584, in connect_ftp return ftpwrapper(user, passwd, host, port, dirs, timeout, File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 2405, in init self.init() File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 2414, in init self.ftp.connect(self.host, self.port, self.timeout) File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/ftplib.py", line 158, in connect self.sock = socket.create_connection((self.host, self.port), self.timeout, File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/socket.py", line 844, in create_connection raise err File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/socket.py", line 832, in create_connection sock.connect(sa) urllib.error.URLError: <urlopen error ftp error: ConnectionRefusedError(111, 'Connection refused')>

--------------------------------- RUN INFO ---------------------------------

Input genome sources include:
  - NCBI accessions listed in GToTree-test-data/ncbi_accessions.txt (8 genomes)
  - Genbank files listed in GToTree-test-data/genbank_files.txt (2 genomes)
  - Fasta files listed in GToTree-test-data/fasta_files.txt (2 genomes)
  - Amino-acid files listed in GToTree-test-data/amino_acid_files.txt (2 genomes)

                         Total input genomes: 14

HMM source to be used:
  - Universal_Hug_et_al.hmm (16 targets)

Options set:
  - The output directory has been set to "GToTree-test-output/"
  - The file "GToTree-test-data/genome_to_id_map.tsv" will be used to modify labels of the specified genomes
  - GTDB taxonomic info will be added to labels where possible
  - NCBI taxonomic info will be added where possible when GTDB is not
  - Number of jobs to run during parallelizable steps has been set to 4
  - Pfams will be searched from: GToTree-test-data/pfam_targets.txt (2 targets)

** NOTICE **
Filtering by gene-length using the median length of a gene set (set with the -c flag) becomes less reliable with fewer genomes. With 14 total input genomes, if a lot of sequences are dropped, consider increasing the parameter and/or visually inspecting the alignments.

More info can be found here:
  github.com/AstrobioMike/GToTree/wiki/Things-to-consider

            Moving forward with "-c" set to 0.2 this run.

Downloading and parsing archaeal and bacterial metadata tables from GTDB (only needs to be done once)...

##############################################################################

Downloading HMMs for additional Pfam targets

##############################################################################

##############################################################################

Working on the genomes provided as NCBI accessions

##############################################################################

Traceback (most recent call last): File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/bin/gtt-parse-assembly-summary-file", line 31, in with open(args.all_assemblies) as assemblies: FileNotFoundError: [Errno 2] No such file or directory: '/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/share/gtotree/ncbi_assembly_summaries//ncbi-assembly-info.tsv'

** NOTICE **
8 accession(s) not successfully found at NCBI.

Reported in "GToTree-test-output/run_files/NCBI_accessions_not_found.txt".


 ******************************* UPDATE *******************************  
    Of the input genomes provided by NCBI accession:

      8 accession(s) not found at NCBI.
      Reported in "GToTree-test-output/run_files/NCBI_accessions_not_found.txt".

    0 of the total 8 input accessions had their genomes successfully
    downloaded and searched.
 ********************************************************************** 

##############################################################################

Working on the genomes provided as GenBank files

##############################################################################

       It is currently 05:43 PM; the process started at 05:43 PM.
           Current process runtime: 0 hours and 0 minutes.

 Genome: GCF_000012505.1_ASM1250v1_genomic
         Found 14 of the targeted 16 genes.
         Est. % comp: 87.50; Est. % redund: 0.00

 Genome: GCA_900473895.1_N32_genomic

** NOTICE **
This genbank file doesn't appear to have CDS annotations, so we are identifying coding sequences with prodigal.

Reported in "GToTree-test-output/run_files/Genbank_files_with_no_CDSs.txt".


         Found 14 of the targeted 16 genes.
         Est. % comp: 87.50; Est. % redund: 0.00

##############################################################################

Working on the genomes provided as fasta files

##############################################################################

       It is currently 05:43 PM; the process started at 05:43 PM.
           Current process runtime: 0 hours and 0 minutes.

 Genome: GCA_000009925.1_ASM992v1_genomic
         Found 14 of the targeted 16 genes.
         Est. % comp: 87.50; Est. % redund: 0.00

 Genome: GCA_000012825.1_ASM1282v1_genomic
         Found 14 of the targeted 16 genes.
         Est. % comp: 87.50; Est. % redund: 0.00

##############################################################################

Working on the genomes provided as amino acid files

##############################################################################

       It is currently 05:43 PM; the process started at 05:43 PM.
           Current process runtime: 0 hours and 0 minutes.

 Genome: GCF_001886455.1_ASM188645v1_protein
         Found 14 of the targeted 16 genes.
         Est. % comp: 87.50; Est. % redund: 0.00

 Genome: GCF_000020585.3_ASM2058v3_protein
         Found 14 of the targeted 16 genes.
         Est. % comp: 87.50; Est. % redund: 0.00

##############################################################################

Filtering genes by length

##############################################################################

 Keeping those with lengths within 20% of the median for the gene set.

       It is currently 05:43 PM; the process started at 05:43 PM.
           Current process runtime: 0 hours and 0 minutes.

** NOTICE **
2 gene(s) either had no hits in any genome, a hit in only one genome, or only multiple hits per genome... Just so ya know!!

    These included:

                 Ribosomal_L15e
                 Ribosomal_L18

Reported in "GToTree-test-output/run_files/Target-genes-not-found-or-retained.txt".

If interested, you can figure out which of those scenarios was the cause by checking out "GToTree-test-output/SCG_hit_counts.tsv".


Filtering Ribosomal_L2 sequences by length...

    Retained 6 sequences of the initial 6 (100.0%).

Filtering Ribosomal_L14 sequences by length...

    Retained 6 sequences of the initial 6 (100.0%).

Filtering Ribosomal_L16 sequences by length...

    Retained 6 sequences of the initial 6 (100.0%).

Filtering Ribosomal_L22 sequences by length...

    Retained 6 sequences of the initial 6 (100.0%).

Filtering ribosomal_L24 sequences by length...

    Retained 6 sequences of the initial 6 (100.0%).

Filtering Ribosomal_L3 sequences by length...

    Retained 6 sequences of the initial 6 (100.0%).

Filtering Ribosomal_L4 sequences by length...

    Retained 6 sequences of the initial 6 (100.0%).

Filtering Ribosomal_L5 sequences by length...

    Retained 6 sequences of the initial 6 (100.0%).

Filtering Ribosomal_S19 sequences by length...

    Retained 6 sequences of the initial 6 (100.0%).

Filtering Ribosomal_L6 sequences by length...

    Retained 6 sequences of the initial 6 (100.0%).

Filtering Ribosomal_S10 sequences by length...

    Retained 6 sequences of the initial 6 (100.0%).

Filtering Ribosomal_S17 sequences by length...

    Retained 6 sequences of the initial 6 (100.0%).

Filtering Ribosomal_S3_C sequences by length...

    Retained 6 sequences of the initial 6 (100.0%).

Filtering Ribosomal_S8 sequences by length...

    Retained 6 sequences of the initial 6 (100.0%).

##############################################################################

Filtering genomes with too few hits

##############################################################################

 Removing those with fewer than 50% of the total SCGs targeted.

       It is currently 05:44 PM; the process started at 05:43 PM.
           Current process runtime: 0 hours and 0 minutes.

         No genomes were removed for having too few hits :)

##############################################################################

Aligning, trimming, and inserting gap-sequences

##############################################################################

       It is currently 05:44 PM; the process started at 05:43 PM.
           Current process runtime: 0 hours and 0 minutes.

        Finished aligning and formatting gene-set Ribosomal_L16.


        Finished aligning and formatting gene-set Ribosomal_L14.


        Finished aligning and formatting gene-set Ribosomal_L2.


        Finished aligning and formatting gene-set Ribosomal_L22.


        Finished aligning and formatting gene-set Ribosomal_L5.


        Finished aligning and formatting gene-set ribosomal_L24.


        Finished aligning and formatting gene-set Ribosomal_L3.


        Finished aligning and formatting gene-set Ribosomal_L4.


        Finished aligning and formatting gene-set Ribosomal_S10.


        Finished aligning and formatting gene-set Ribosomal_L6.


        Finished aligning and formatting gene-set Ribosomal_S19.


        Finished aligning and formatting gene-set Ribosomal_S17.


        Finished aligning and formatting gene-set Ribosomal_S8.


        Finished aligning and formatting gene-set Ribosomal_S3_C.


##############################################################################

Catting all alignments together

##############################################################################

       It is currently 05:44 PM; the process started at 05:43 PM.
           Current process runtime: 0 hours and 1 minutes.

##############################################################################

Adding more informative headers

##############################################################################

17:44:15.263 [ERRO] taxonomy data not found, please download and uncompress ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz, and copy "names.dmp", "nodes.dmp", "delnodes.dmp", and "merged.dmp" to /beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/share/gtotree/ncbi_tax_info/ 17:44:15.263 [ERRO] taxonomy data not found, please download and uncompress ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz, and copy "names.dmp", "nodes.dmp", "delnodes.dmp", and "merged.dmp" to /beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/share/gtotree/ncbi_tax_info/ ** NOTICE **
Strain-level labels were requested in addition to using GTDB taxonomy where possible. This is just a note that there will be no strain-level labels added for those that had GTDB lineage info added.



##############################################################################

Parsing results of additional Pfam searches

##############################################################################

##############################################################################

Running FastTreeMP

##############################################################################

       It is currently 05:44 PM; the process started at 05:43 PM.
           Current process runtime: 0 hours and 1 minutes.

FastTree Version 2.1.11 Double precision (No SSE3), OpenMP (4 threads) Alignment: GToTree-test-output/Aligned_SCGs_mod_names.faa Amino acid distances: BLOSUM45 Joins: balanced Support: SH-like 1000 Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1 TopHits: 1.00*sqrtN close=default refresh=0.80 ML Model: Jones-Taylor-Thorton, CAT approximation with 20 rate categories Ignored unknown character X (seen 390 times) Initial topology in 0.01 seconds Refining topology: 10 rounds ME-NNIs, 2 rounds ME-SPRs, 5 rounds ML-NNIs Total branch-length 1.093 after 0.03 sec ML-NNI round 1: LogLk = -15470.209 NNIs 0 max delta 0.00 Time 0.14 0.14 seconds: Site likelihoods with rate category 1 of 20 0.25 seconds: Site likelihoods with rate category 19 of 20 Switched to using 20 rate categories (CAT approximation) Rate categories were divided by 0.754 so that average rate = 1.0 CAT-based log-likelihoods may not be comparable across runs Use -gamma for approximate but comparable Gamma(20) log-likelihoods ML-NNI round 2: LogLk = -14749.153 NNIs 0 max delta 0.00 Time 0.33 Turning off heuristics for final round of ML NNIs (converged) ML-NNI round 3: LogLk = -14749.091 NNIs 0 max delta 0.00 Time 0.39 (final) 0.39 seconds: ML Lengths 1 of 4 splits Optimize all lengths: LogLk = -14749.073 Time 0.45 Total time: 0.53 seconds Unique: 6/6 Bad splits: 0/3

#################################################################################

Done!!

#################################################################################

Overall, 6 genomes of the input 14 were retained (see notes below).

Tree written to:
    GToTree-test-output/GToTree-test-output.tre

Alignment written to:
    GToTree-test-output/Aligned_SCGs_mod_names.faa

Main genomes summary table written to:
    GToTree-test-output/Genomes_summary_info.tsv

Summary table with hits per target gene per genome written to:
    GToTree-test-output/SCG_hit_counts.tsv

Outputs from Pfam searching written to:
    GToTree-test-output/Pfam_search_results/

Partitions file (for downstream use with mixed-model treeing) written to:
    GToTree-test-output/run_files/Partitions.txt

Notes:

    8 accession(s) not successfully found at NCBI.
    2 gene(s) either had no hits or only multiple hits in each genome.

Reported along with additional informative run files in:
    GToTree-test-output/run_files/

Log file written to:
    GToTree-test-output/gtotree-runlog.txt

Programs used and their citations have been written to:
    GToTree-test-output/citations.txt

                                     Total process runtime: 0 hours and 1 minutes.
                                                  Happy Monday :)

Test completed! See here for how things should look: https://github.com/AstrobioMike/GToTree/wiki/Installation#test-run

You can clear out the test data and results by running: gtt-clean-after-test.sh

corkdagga commented 1 year ago

with python version 3.9 :)

AstrobioMike commented 1 year ago

well bummer! But still great you got the latest one setup 👍

Also got us one step further, where the test data is being downloaded properly at least

i will add that capability ASAP, but if you are in the same exact location, in the meantime you could manually trigger the test after the above failure. Running it as gtt-test.sh is needed first, as that will grab the test data. But then once it fails you can run the command it uses (printed at the top of the log), and add the -P option, so it'd look like this:

GToTree -a GToTree-test-data/ncbi_accessions.txt \
               -g GToTree-test-data/genbank_files.txt \
               -f GToTree-test-data/fasta_files.txt \
               -A GToTree-test-data/amino_acid_files.txt \
               -m GToTree-test-data/genome_to_id_map.tsv \
               -H Universal -t -D -j 4 -o GToTree-test-output -F -P

And hopefully that'll work, though i also removed the pfam part as that might not have a non-ftp option available, i forget at the moment. But see if you can run that after gtt-test.sh fails.

corkdagga commented 1 year ago

Good morning,

This time I got the following results. It seems to have worked better but there are still several errors.

(gtotree-1.8.1) pada358b@tauruslogin6:/beegfs/ws/0/pada358b-conda$ GToTree -a GToTree-test-data/ncbi_accessions.txt -g GToTree-test-data/genbank_files.txt -f GToTree-test-data/fasta_files.txt -A GToTree-test-data/amino_acid_files.txt -m GToTree-test-data/genome_to_id_map.tsv -H Universal -t -D -j 4 -o GToTree-test-output -F -P

                              GToTree v1.8.1
                     (github.com/AstrobioMike/GToTree)

Traceback (most recent call last): File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/bin/gtt-get-ncbi-assembly-tables", line 177, in main() File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/bin/gtt-get-ncbi-assembly-tables", line 36, in main data_present = check_if_data_present_and_less_than_4_weeks_old(NCBI_assembly_data_dir) File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/bin/gtt-get-ncbi-assembly-tables", line 114, in check_if_data_present_and_less_than_4_weeks_old stored_date = date(int(stored_date_list[0]), int(stored_date_list[1]), int(stored_date_list[2])) ValueError: invalid literal for int() with base 10: '' Downloading required NCBI taxonomy data (only needs to be done once)...

Traceback (most recent call last): File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 1563, in ftp_open fw = self.connect_ftp(user, passwd, host, port, dirs, req.timeout) File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 1584, in connect_ftp return ftpwrapper(user, passwd, host, port, dirs, timeout, File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 2405, in init self.init() File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 2414, in init self.ftp.connect(self.host, self.port, self.timeout) File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/ftplib.py", line 158, in connect self.sock = socket.create_connection((self.host, self.port), self.timeout, File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/socket.py", line 844, in create_connection raise err File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/socket.py", line 832, in create_connection sock.connect(sa) ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/bin/gtt-get-ncbi-tax-data", line 123, in main() File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/bin/gtt-get-ncbi-tax-data", line 41, in main get_NCBI_tax_data(NCBI_data_dir) File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/bin/gtt-get-ncbi-tax-data", line 110, in get_NCBI_tax_data urllib.request.urlretrieve("ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz", taxdump_path) File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 239, in urlretrieve with contextlib.closing(urlopen(url, data)) as fp: File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 214, in urlopen return opener.open(url, data, timeout) File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 517, in open response = self._open(req, data) File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 534, in _open result = self._call_chain(self.handle_open, protocol, protocol + File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 494, in _call_chain result = func(*args) File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 1581, in ftp_open raise exc.with_traceback(sys.exc_info()[2]) File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 1563, in ftp_open fw = self.connect_ftp(user, passwd, host, port, dirs, req.timeout) File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 1584, in connect_ftp return ftpwrapper(user, passwd, host, port, dirs, timeout, File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 2405, in init self.init() File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/urllib/request.py", line 2414, in init self.ftp.connect(self.host, self.port, self.timeout) File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/ftplib.py", line 158, in connect self.sock = socket.create_connection((self.host, self.port), self.timeout, File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/socket.py", line 844, in create_connection raise err File "/beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/lib/python3.9/socket.py", line 832, in create_connection sock.connect(sa) urllib.error.URLError: <urlopen error ftp error: ConnectionRefusedError(111, 'Connection refused')>

--------------------------------- RUN INFO ---------------------------------

Input genome sources include:
  - NCBI accessions listed in GToTree-test-data/ncbi_accessions.txt (8 genomes)
  - Genbank files listed in GToTree-test-data/genbank_files.txt (2 genomes)
  - Fasta files listed in GToTree-test-data/fasta_files.txt (2 genomes)
  - Amino-acid files listed in GToTree-test-data/amino_acid_files.txt (2 genomes)

                         Total input genomes: 14

HMM source to be used:
  - Universal_Hug_et_al (16 targets)

Options set:
  - The output directory has been set to "GToTree-test-output/"
  - The file "GToTree-test-data/genome_to_id_map.tsv" will be used to modify labels of the specified genomes
  - GTDB taxonomic info will be added to labels where possible
  - NCBI taxonomic info will be added where possible when GTDB is not
  - Number of jobs to run during parallelizable steps has been set to 4
  - Attempting to use http instead of ftp

** NOTICE **
Filtering by gene-length using the median length of a gene set (set with the -c flag) becomes less reliable with fewer genomes. With 14 total input genomes, if a lot of sequences are dropped, consider increasing the parameter and/or visually inspecting the alignments.

More info can be found here:
  github.com/AstrobioMike/GToTree/wiki/Things-to-consider

            Moving forward with "-c" set to 0.2 this run.

##############################################################################

Working on the genomes provided as NCBI accessions

##############################################################################

** NOTICE **
6 accession(s) not successfully found at NCBI.

Reported in "GToTree-test-output/run_files/NCBI_accessions_not_found.txt".


 Genome: GCA_000299365.1
         Found 15 of the targeted 16 genes.
         Est. % comp: 93.75; Est. % redund: 0.00

 Genome: GCA_000172635
         Found 14 of the targeted 16 genes.
         Est. % comp: 87.50; Est. % redund: 0.00

 ******************************* UPDATE *******************************  
    Of the input genomes provided by NCBI accession:

      6 accession(s) not found at NCBI.
      Reported in "GToTree-test-output/run_files/NCBI_accessions_not_found.txt".

    2 of the total 8 input accessions had their genomes successfully
    downloaded and searched.
 ********************************************************************** 

##############################################################################

Working on the genomes provided as GenBank files

##############################################################################

       It is currently 07:42 AM; the process started at 07:29 AM.
           Current process runtime: 0 hours and 13 minutes.

 Genome: GCF_000012505.1_ASM1250v1_genomic
         Found 14 of the targeted 16 genes.
         Est. % comp: 87.50; Est. % redund: 0.00

 Genome: GCA_900473895.1_N32_genomic

** NOTICE **
This genbank file doesn't appear to have CDS annotations, so we are identifying coding sequences with prodigal.

Reported in "GToTree-test-output/run_files/Genbank_files_with_no_CDSs.txt".


         Found 14 of the targeted 16 genes.
         Est. % comp: 87.50; Est. % redund: 0.00

##############################################################################

Working on the genomes provided as fasta files

##############################################################################

       It is currently 07:42 AM; the process started at 07:29 AM.
           Current process runtime: 0 hours and 13 minutes.

 Genome: GCA_000012825.1_ASM1282v1_genomic
         Found 14 of the targeted 16 genes.
         Est. % comp: 87.50; Est. % redund: 0.00

 Genome: GCA_000009925.1_ASM992v1_genomic
         Found 14 of the targeted 16 genes.
         Est. % comp: 87.50; Est. % redund: 0.00

##############################################################################

Working on the genomes provided as amino acid files

##############################################################################

       It is currently 07:43 AM; the process started at 07:29 AM.
           Current process runtime: 0 hours and 13 minutes.

 Genome: GCF_000020585.3_ASM2058v3_protein
         Found 14 of the targeted 16 genes.
         Est. % comp: 87.50; Est. % redund: 0.00

 Genome: GCF_001886455.1_ASM188645v1_protein
         Found 14 of the targeted 16 genes.
         Est. % comp: 87.50; Est. % redund: 0.00

##############################################################################

Filtering genes by length

##############################################################################

 Keeping those with lengths within 20% of the median for the gene set.

       It is currently 07:43 AM; the process started at 07:29 AM.
           Current process runtime: 0 hours and 13 minutes.

** NOTICE **
2 gene(s) either had no hits in any genome, a hit in only one genome, or only multiple hits per genome... Just so ya know!!

    These included:

                 Ribosomal_L15e
                 Ribosomal_L18

Reported in "GToTree-test-output/run_files/Target-genes-not-found-or-retained.txt".

If interested, you can figure out which of those scenarios was the cause by checking out "GToTree-test-output/SCG_hit_counts.tsv".


Filtering Ribosomal_L14 sequences by length...

    Retained 8 sequences of the initial 8 (100.0%).

Filtering Ribosomal_L16 sequences by length...

    Retained 8 sequences of the initial 8 (100.0%).

Filtering Ribosomal_L22 sequences by length...

    Retained 7 sequences of the initial 8 (87.5%).

Filtering Ribosomal_L2 sequences by length...

    Retained 8 sequences of the initial 8 (100.0%).

Filtering Ribosomal_L3 sequences by length...

    Retained 7 sequences of the initial 8 (87.5%).

Filtering ribosomal_L24 sequences by length...

    Retained 7 sequences of the initial 7 (100.0%).

Filtering Ribosomal_L5 sequences by length...

    Retained 8 sequences of the initial 8 (100.0%).

Filtering Ribosomal_L4 sequences by length...

    Retained 7 sequences of the initial 8 (87.5%).

Filtering Ribosomal_L6 sequences by length...

    Retained 8 sequences of the initial 8 (100.0%).

Filtering Ribosomal_S10 sequences by length...

    Retained 8 sequences of the initial 8 (100.0%).

Filtering Ribosomal_S19 sequences by length...

    Retained 7 sequences of the initial 8 (87.5%).

Filtering Ribosomal_S17 sequences by length...

    Retained 7 sequences of the initial 8 (87.5%).

Filtering Ribosomal_S8 sequences by length...

    Retained 8 sequences of the initial 8 (100.0%).

Filtering Ribosomal_S3_C sequences by length...

    Retained 8 sequences of the initial 8 (100.0%).

##############################################################################

Filtering genomes with too few hits

##############################################################################

 Removing those with fewer than 50% of the total SCGs targeted.

       It is currently 07:43 AM; the process started at 07:29 AM.
           Current process runtime: 0 hours and 14 minutes.

         No genomes were removed for having too few hits :)

##############################################################################

Aligning, trimming, and inserting gap-sequences

##############################################################################

       It is currently 07:43 AM; the process started at 07:29 AM.
           Current process runtime: 0 hours and 14 minutes.

        Finished aligning and formatting gene-set Ribosomal_L16.


        Finished aligning and formatting gene-set Ribosomal_L22.


        Finished aligning and formatting gene-set Ribosomal_L14.


        Finished aligning and formatting gene-set Ribosomal_L2.


        Finished aligning and formatting gene-set ribosomal_L24.


        Finished aligning and formatting gene-set Ribosomal_L5.


        Finished aligning and formatting gene-set Ribosomal_L3.


        Finished aligning and formatting gene-set Ribosomal_L4.


        Finished aligning and formatting gene-set Ribosomal_S10.


        Finished aligning and formatting gene-set Ribosomal_L6.


        Finished aligning and formatting gene-set Ribosomal_S17.


        Finished aligning and formatting gene-set Ribosomal_S19.


        Finished aligning and formatting gene-set Ribosomal_S8.


        Finished aligning and formatting gene-set Ribosomal_S3_C.


##############################################################################

Catting all alignments together

##############################################################################

       It is currently 07:43 AM; the process started at 07:29 AM.
           Current process runtime: 0 hours and 14 minutes.

##############################################################################

Adding more informative headers

##############################################################################

07:43:32.051 [ERRO] taxonomy data not found, please download and uncompress ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz, and copy "names.dmp", "nodes.dmp", "delnodes.dmp", and "merged.dmp" to /beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/share/gtotree/ncbi_tax_info/ 07:43:32.051 [ERRO] taxonomy data not found, please download and uncompress ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz, and copy "names.dmp", "nodes.dmp", "delnodes.dmp", and "merged.dmp" to /beegfs/ws/0/pada358b-conda/envs/gtotree-1.8.1/share/gtotree/ncbi_tax_info/ ** NOTICE **
Strain-level labels were requested in addition to using GTDB taxonomy where possible. This is just a note that there will be no strain-level labels added for those that had GTDB lineage info added.



##############################################################################

Running FastTreeMP

##############################################################################

       It is currently 07:43 AM; the process started at 07:29 AM.
           Current process runtime: 0 hours and 14 minutes.

FastTree Version 2.1.11 Double precision (No SSE3), OpenMP (4 threads) Alignment: GToTree-test-output/Aligned_SCGs_mod_names.faa Amino acid distances: BLOSUM45 Joins: balanced Support: SH-like 1000 Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1 TopHits: 1.00*sqrtN close=default refresh=0.80 ML Model: Jones-Taylor-Thorton, CAT approximation with 20 rate categories Ignored unknown character X (seen 520 times) Initial topology in 0.01 seconds Refining topology: 12 rounds ME-NNIs, 2 rounds ME-SPRs, 6 rounds ML-NNIs Total branch-length 2.096 after 0.07 sec 0.15 seconds: ML NNI round 1 of 6, 1 of 6 splits ML-NNI round 1: LogLk = -18733.547 NNIs 0 max delta 0.00 Time 0.26 0.26 seconds: Site likelihoods with rate category 1 of 20 0.36 seconds: Site likelihoods with rate category 13 of 20 Switched to using 20 rate categories (CAT approximation) Rate categories were divided by 0.781 so that average rate = 1.0 CAT-based log-likelihoods may not be comparable across runs Use -gamma for approximate but comparable Gamma(20) log-likelihoods ML-NNI round 2: LogLk = -17866.440 NNIs 0 max delta 0.00 Time 0.54 Turning off heuristics for final round of ML NNIs (converged) 0.54 seconds: ML NNI round 3 of 6, 1 of 6 splits ML-NNI round 3: LogLk = -17866.391 NNIs 0 max delta 0.00 Time 0.64 (final) Optimize all lengths: LogLk = -17866.382 Time 0.72 Total time: 0.84 seconds Unique: 8/8 Bad splits: 0/5

#################################################################################

Done!!

#################################################################################

Overall, 8 genomes of the input 14 were retained (see notes below).

Tree written to:
    GToTree-test-output/GToTree-test-output.tre

Alignment written to:
    GToTree-test-output/Aligned_SCGs_mod_names.faa

Main genomes summary table written to:
    GToTree-test-output/Genomes_summary_info.tsv

Summary table with hits per target gene per genome written to:
    GToTree-test-output/SCG_hit_counts.tsv

Partitions file (for downstream use with mixed-model treeing) written to:
    GToTree-test-output/run_files/Partitions.txt

Notes:

    6 accession(s) not successfully found at NCBI.
    2 gene(s) either had no hits or only multiple hits in each genome.

Reported along with additional informative run files in:
    GToTree-test-output/run_files/

Log file written to:
    GToTree-test-output/gtotree-runlog.txt

Programs used and their citations have been written to:
    GToTree-test-output/citations.txt

                                     Total process runtime: 0 hours and 14 minutes.
                                                  Happy Tuesday :)

(gtotree-1.8.1) pada358b@tauruslogin6:/beegfs/ws/0/pada358b-conda$

AstrobioMike commented 1 year ago

Ok, this is helpful. I'll try to work on this tomorrow, thank you for your help in making things more robust. Sorry it's giving you trouble!

corkdagga commented 1 year ago

No worries, thanks for being so quick with all your help!

Could the connection issues be a problem with my institution/HPC?

AstrobioMike commented 1 year ago

Maybe, but it might still just be an ftp problem. I implemented better handling of downloading things recently, and it looks like maybe I forgot to control for the HTTP flag when downloading the NCBI metadata, as above it looks like it's still trying the ftp address even though you ran it with -P - so it should be trying the http address. Ftp being a problem is a relatively rare scenario, and it isn't one that I know how to setup a way to test it, ha. So this is super-helpful for me. I'll dig into things my tomorrow and get back to you :)

AstrobioMike commented 1 year ago

Okie, i was mistaken. GToTree was appropriately trying to use the http version when trying to pull in the ncbi assembly summary tables. Seeing "ftp" in the address tried in the log above confused me, but that's still a part of the http address, as it starts with https://ftp....

So that part was fine, and suggests there might be a problem accessing the ncbi data either way on your cluster. But let's confirm that with the smallest chunk possible. Can you see if either of these download commands work on your cluster?

ftp way:

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt

http way:

wget https://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt

if downloaded successfully, it should look something like this size-wise:

ls -l assembly_summary_refseq.txt
# 146570308
corkdagga commented 1 year ago

Hi,

Sorry for the delay. It seems like ftp is still a problem. Below is the output for your instructions.

pada358b@tauruslogin3:/beegfs/ws/0/pada358b-conda$ wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt --2023-07-26 14:42:36-- ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt => ‘assembly_summary_refseq.txt’ Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.11, 130.14.250.13, 2607:f220:41e:250::7, ... Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.11|:21... failed: Connection refused. Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.13|:21... failed: Connection refused. Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|2607:f220:41e:250::7|:21... failed: Network is unreachable. Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|2607:f220:41f:250::228|:21... failed: Network is unreachable.

pada358b@tauruslogin3:/beegfs/ws/0/pada358b-conda$ wget https://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt --2023-07-26 14:42:47-- https://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.13, 130.14.250.11, 2607:f220:41f:250::228, ... Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.13|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 146717099 (140M) [text/plain] Saving to: ‘assembly_summary_refseq.txt’

100%[=================================================================================================>] 146,717,099 28.5MB/s in 5.6s

2023-07-26 14:42:53 (24.9 MB/s) - ‘assembly_summary_refseq.txt’ saved [146717099/146717099]

pada358b@tauruslogin3:/beegfs/ws/0/pada358b-conda$ ls -l assembly_summary_refseq.txt -rw-r--r-- 1 pada358b p_microbpath 146717099 Jul 26 12:34 assembly_summary_refseq.txt pada358b@tauruslogin3:/beegfs/ws/0/pada358b-conda$

AstrobioMike commented 1 year ago

ah, great that the http way works though 👍

That just means there's still something i missed setting up properly to use http instead of ftp when directed to

Sorry for the delay on my end too.

I've updated the fixes i think were required, and they are live as of version 1.8.2. Can you try removing that conda environment, and making a new one, e.g. with:

conda create -n gtotree -c astrobiomike -c conda-forge -c bioconda -c defaults gtotree=1.8.2

conda activate gtotree

Then try the test like so:

gtt-test.sh http

If that works, then when running the actual program on your data, you just want to include the -P flag, which tells the main program to use http instead of ftp.

I'm sorry this has been such a pain for you!

I broke my knee a couple weeks ago and am finally getting into surgery tomorrow (thank goodness!). So if you don't hear back from me for a few days, that's why :)

corkdagga commented 1 year ago

Thanks again! I think everything worked fine, at least there were no errors that I could easily see. Below is the output.

Thanks for the help and speedy recovery.

(gtotree) pada358b@tauruslogin3:/beegfs/ws/0/pada358b-conda$ gtt-test.sh http

Downloading GToTree test data into the subdirectory GToTree-test-data/

Test data being pulled from here: https://zenodo.org/record/7860720#.ZEcWkexlA_8

% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 7163k 100 7163k 0 0 1346k 0 0:00:05 0:00:05 --:--:-- 1677k

Running test as: GToTree -a GToTree-test-data/ncbi_accessions.txt \ -g GToTree-test-data/genbank_files.txt \ -f GToTree-test-data/fasta_files.txt \ -A GToTree-test-data/amino_acid_files.txt \ -m GToTree-test-data/genome_to_id_map.tsv \ -p GToTree-test-data/pfam_targets.txt \ -H Universal -t -D -j 4 -o GToTree-test-output -F -P

The test run includes some things that shouldn't be found, so don't be alarmed when seeing those messages.

Starting run now:

                              GToTree v1.8.2
                     (github.com/AstrobioMike/GToTree)

Downloading NCBI assembly summaries (only done once, or updated after 4 weeks)...

Downloading required NCBI taxonomy data (only needs to be done once)...

--------------------------------- RUN INFO ---------------------------------

Input genome sources include:
  - NCBI accessions listed in GToTree-test-data/ncbi_accessions.txt (8 genomes)
  - Genbank files listed in GToTree-test-data/genbank_files.txt (2 genomes)
  - Fasta files listed in GToTree-test-data/fasta_files.txt (2 genomes)
  - Amino-acid files listed in GToTree-test-data/amino_acid_files.txt (2 genomes)

                         Total input genomes: 14

HMM source to be used:
  - Universal_Hug_et_al.hmm (16 targets)

Options set:
  - The output directory has been set to "GToTree-test-output/"
  - The file "GToTree-test-data/genome_to_id_map.tsv" will be used to modify labels of the specified genomes
  - GTDB taxonomic info will be added to labels where possible
  - NCBI taxonomic info will be added where possible when GTDB is not
  - Number of jobs to run during parallelizable steps has been set to 4
  - Attempting to use http instead of ftp
  - Pfams will be searched from: GToTree-test-data/pfam_targets.txt (2 targets)

** NOTICE **
Filtering by gene-length using the median length of a gene set (set with the -c flag) becomes less reliable with fewer genomes. With 14 total input genomes, if a lot of sequences are dropped, consider increasing the parameter and/or visually inspecting the alignments.

More info can be found here:
  github.com/AstrobioMike/GToTree/wiki/Things-to-consider

            Moving forward with "-c" set to 0.2 this run.

Downloading and parsing archaeal and bacterial metadata tables from GTDB (only needs to be done once)...

##############################################################################

Downloading HMMs for additional Pfam targets

##############################################################################

##############################################################################

Working on the genomes provided as NCBI accessions

##############################################################################

** NOTICE **
1 accession(s) not successfully found at NCBI.

Reported in "GToTree-test-output/run_files/NCBI_accessions_not_found.txt".


 Genome: GCA_003818365.1
         Found 0 of the targeted 16 genes.
         Est. % comp: 0.00; Est. % redund: 0.00

 Genome: GCA_000299365.1
         Found 15 of the targeted 16 genes.
         Est. % comp: 93.75; Est. % redund: 0.00

 Genome: GCA_000172635
         Found 14 of the targeted 16 genes.
         Est. % comp: 87.50; Est. % redund: 0.00

 Genome: GCF_000013045.1
         Found 14 of the targeted 16 genes.
         Est. % comp: 87.50; Est. % redund: 6.25

 Genome: GCF_000153765.1
         Found 14 of the targeted 16 genes.
         Est. % comp: 87.50; Est. % redund: 0.00

 Genome: GCF_000972705.1
         Found 14 of the targeted 16 genes.
         Est. % comp: 87.50; Est. % redund: 0.00

 Genome: GCF_900162675.1
         Found 14 of the targeted 16 genes.
         Est. % comp: 87.50; Est. % redund: 0.00

 ******************************* UPDATE *******************************  
    Of the input genomes provided by NCBI accession:

      1 accession(s) not found at NCBI.
      Reported in "GToTree-test-output/run_files/NCBI_accessions_not_found.txt".

    7 of the total 8 input accessions had their genomes successfully
    downloaded and searched.
 ********************************************************************** 

##############################################################################

Working on the genomes provided as GenBank files

##############################################################################

       It is currently 08:52 AM; the process started at 08:38 AM.
           Current process runtime: 0 hours and 13 minutes.

 Genome: GCF_000012505.1_ASM1250v1_genomic
         Found 14 of the targeted 16 genes.
         Est. % comp: 87.50; Est. % redund: 0.00

 Genome: GCA_900473895.1_N32_genomic

** NOTICE **
This genbank file doesn't appear to have CDS annotations, so we are identifying coding sequences with prodigal.

Reported in "GToTree-test-output/run_files/Genbank_files_with_no_CDSs.txt".


         Found 14 of the targeted 16 genes.
         Est. % comp: 87.50; Est. % redund: 0.00

##############################################################################

Working on the genomes provided as fasta files

##############################################################################

       It is currently 08:52 AM; the process started at 08:38 AM.
           Current process runtime: 0 hours and 13 minutes.

 Genome: GCA_000009925.1_ASM992v1_genomic
         Found 14 of the targeted 16 genes.
         Est. % comp: 87.50; Est. % redund: 0.00

 Genome: GCA_000012825.1_ASM1282v1_genomic
         Found 14 of the targeted 16 genes.
         Est. % comp: 87.50; Est. % redund: 0.00

##############################################################################

Working on the genomes provided as amino acid files

##############################################################################

       It is currently 08:53 AM; the process started at 08:38 AM.
           Current process runtime: 0 hours and 14 minutes.

 Genome: GCF_001886455.1_ASM188645v1_protein
         Found 14 of the targeted 16 genes.
         Est. % comp: 87.50; Est. % redund: 0.00

 Genome: GCF_000020585.3_ASM2058v3_protein
         Found 14 of the targeted 16 genes.
         Est. % comp: 87.50; Est. % redund: 0.00

##############################################################################

Filtering genes by length

##############################################################################

 Keeping those with lengths within 20% of the median for the gene set.

       It is currently 08:53 AM; the process started at 08:38 AM.
           Current process runtime: 0 hours and 14 minutes.

** NOTICE **
2 gene(s) either had no hits in any genome, a hit in only one genome, or only multiple hits per genome... Just so ya know!!

    These included:

                 Ribosomal_L15e
                 Ribosomal_L18

Reported in "GToTree-test-output/run_files/Target-genes-not-found-or-retained.txt".

If interested, you can figure out which of those scenarios was the cause by checking out "GToTree-test-output/SCG_hit_counts.tsv".


Filtering Ribosomal_L2 sequences by length...

    Retained 12 sequences of the initial 12 (100.0%).

Filtering Ribosomal_L16 sequences by length...

    Retained 12 sequences of the initial 12 (100.0%).

Filtering Ribosomal_L22 sequences by length...

    Retained 11 sequences of the initial 12 (91.67%).

Filtering Ribosomal_L14 sequences by length...

    Retained 12 sequences of the initial 12 (100.0%).

Filtering ribosomal_L24 sequences by length...

    Retained 11 sequences of the initial 11 (100.0%).

Filtering Ribosomal_L3 sequences by length...

    Retained 11 sequences of the initial 12 (91.67%).

Filtering Ribosomal_L5 sequences by length...

    Retained 12 sequences of the initial 12 (100.0%).

Filtering Ribosomal_L4 sequences by length...

    Retained 11 sequences of the initial 12 (91.67%).

Filtering Ribosomal_L6 sequences by length...

    Retained 12 sequences of the initial 12 (100.0%).

Filtering Ribosomal_S17 sequences by length...

    Retained 11 sequences of the initial 12 (91.67%).

Filtering Ribosomal_S19 sequences by length...

    Retained 11 sequences of the initial 12 (91.67%).

Filtering Ribosomal_S10 sequences by length...

    Retained 11 sequences of the initial 11 (100.0%).

Filtering Ribosomal_S8 sequences by length...

    Retained 12 sequences of the initial 12 (100.0%).

Filtering Ribosomal_S3_C sequences by length...

    Retained 12 sequences of the initial 12 (100.0%).

##############################################################################

Filtering genomes with too few hits

##############################################################################

 Removing those with fewer than 50% of the total SCGs targeted.

       It is currently 08:53 AM; the process started at 08:38 AM.
           Current process runtime: 0 hours and 14 minutes.

** NOTICE **
1 genome(s) removed from analysis due to having too few hits.

Reported in "GToTree-test-output/run_files/Genomes_removed_for_too_few_hits.tsv".


##############################################################################

Aligning, trimming, and inserting gap-sequences

##############################################################################

       It is currently 08:53 AM; the process started at 08:38 AM.
           Current process runtime: 0 hours and 14 minutes.

        Finished aligning and formatting gene-set Ribosomal_L14.


        Finished aligning and formatting gene-set Ribosomal_L22.


        Finished aligning and formatting gene-set Ribosomal_L16.


        Finished aligning and formatting gene-set Ribosomal_L2.


        Finished aligning and formatting gene-set ribosomal_L24.


        Finished aligning and formatting gene-set Ribosomal_L4.


        Finished aligning and formatting gene-set Ribosomal_L3.


        Finished aligning and formatting gene-set Ribosomal_L5.


        Finished aligning and formatting gene-set Ribosomal_L6.


        Finished aligning and formatting gene-set Ribosomal_S10.


        Finished aligning and formatting gene-set Ribosomal_S17.


        Finished aligning and formatting gene-set Ribosomal_S19.


        Finished aligning and formatting gene-set Ribosomal_S8.


        Finished aligning and formatting gene-set Ribosomal_S3_C.


##############################################################################

Catting all alignments together

##############################################################################

       It is currently 08:53 AM; the process started at 08:38 AM.
           Current process runtime: 0 hours and 14 minutes.

##############################################################################

Adding more informative headers

##############################################################################

** NOTICE **
Strain-level labels were requested in addition to using GTDB taxonomy where possible. This is just a note that there will be no strain-level labels added for those that had GTDB lineage info added.


** NOTICE **
1 accession(s) of the searched 9 were not successfully found in GTDB.

Reported in "GToTree-test-output/run_files/GTDB_accessions_not_found.tsv".


##############################################################################

Parsing results of additional Pfam searches

##############################################################################

##############################################################################

Running FastTreeMP

##############################################################################

       It is currently 08:53 AM; the process started at 08:38 AM.
           Current process runtime: 0 hours and 14 minutes.

FastTree Version 2.1.11 Double precision (No SSE3), OpenMP (4 threads) Alignment: GToTree-test-output/Aligned_SCGs_mod_names.faa Amino acid distances: BLOSUM45 Joins: balanced Support: SH-like 1000 Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1 TopHits: 1.00*sqrtN close=default refresh=0.80 ML Model: Jones-Taylor-Thorton, CAT approximation with 20 rate categories Ignored unknown character X (seen 780 times) Initial topology in 0.02 seconds Refining topology: 14 rounds ME-NNIs, 2 rounds ME-SPRs, 7 rounds ML-NNIs 0.10 seconds: SPR round 2 of 2, 1 of 22 nodes Total branch-length 3.040 after 0.19 sec 0.33 seconds: ML NNI round 1 of 7, 1 of 10 splits ML-NNI round 1: LogLk = -30657.663 NNIs 0 max delta 0.00 Time 0.54 0.55 seconds: Site likelihoods with rate category 1 of 20 0.65 seconds: Site likelihoods with rate category 8 of 20 0.76 seconds: Site likelihoods with rate category 15 of 20 Switched to using 20 rate categories (CAT approximation) Rate categories were divided by 0.843 so that average rate = 1.0 CAT-based log-likelihoods may not be comparable across runs Use -gamma for approximate but comparable Gamma(20) log-likelihoods 0.87 seconds: ML NNI round 2 of 7, 1 of 10 splits ML-NNI round 2: LogLk = -28893.218 NNIs 0 max delta 0.00 Time 1.09 Turning off heuristics for final round of ML NNIs (converged) 1.08 seconds: ML NNI round 3 of 7, 1 of 10 splits ML-NNI round 3: LogLk = -28893.161 NNIs 0 max delta 0.00 Time 1.30 (final) 1.29 seconds: ML Lengths 1 of 10 splits Optimize all lengths: LogLk = -28893.159 Time 1.44 Total time: 1.67 seconds Unique: 12/12 Bad splits: 0/9

#################################################################################

Done!!

#################################################################################

Overall, 12 genomes of the input 14 were retained (see notes below).

Tree written to:
    GToTree-test-output/GToTree-test-output.tre

Alignment written to:
    GToTree-test-output/Aligned_SCGs_mod_names.faa

Main genomes summary table written to:
    GToTree-test-output/Genomes_summary_info.tsv

Summary table with hits per target gene per genome written to:
    GToTree-test-output/SCG_hit_counts.tsv

Outputs from Pfam searching written to:
    GToTree-test-output/Pfam_search_results/

Partitions file (for downstream use with mixed-model treeing) written to:
    GToTree-test-output/run_files/Partitions.txt

Notes:

    1 accession(s) not successfully found at NCBI.
    1 genome(s) removed due to having too few hits to the targeted SCGs.
    2 gene(s) either had no hits or only multiple hits in each genome.

Reported along with additional informative run files in:
    GToTree-test-output/run_files/

Log file written to:
    GToTree-test-output/gtotree-runlog.txt

Programs used and their citations have been written to:
    GToTree-test-output/citations.txt

                                     Total process runtime: 0 hours and 14 minutes.
                                                  Happy Thursday :)

Test completed! See here for how things should look: https://github.com/AstrobioMike/GToTree/wiki/Installation#test-run

You can clear out the test data and results by running: gtt-clean-after-test.sh

(gtotree) pada358b@tauruslogin3:/beegfs/ws/0/pada358b-conda$

AstrobioMike commented 1 year ago

Awesome, yep that all looks good! Thanks for writing in and helping to smooth this out

And thanks for the recovery wishes :)