antonisdim / haystac

Code repository for the HAYSTAC pipeline
MIT License
12 stars 4 forks source link

Issue downloading database on a cluster (Protocol not supported) #18

Closed BenjaminGuinet closed 9 months ago

BenjaminGuinet commented 11 months ago

Dear Haystac developers, I tried to use Haystac on a cluster but unfortunately I got the following error message : "Protocol not supported" in the bgzip command part, do you have any idea what is going on please?

Here is the command I used (it is the same if I put Yersina).

haystac database --mode build --query '"Clostridium"[Organism] AND "complete genome"[All Fields]' --output Clostridium_db --cores 20

Thanks for your time.

Here are the messages :


haystac database --mode build --query  '"Clostridium"[Organism] AND "complete genome"[All Fields]' --output Clostridium_db --cores 20
HAYSTAC v 0.4.10

Date: 2023-10-01 22:18:42.682037

Config parameters:

 mode: build
 db_output: /crex/proj/sprok/nobackup/GRNEDEL/Meta_project/Test_folder/Scripts/Clostridium_db
 query: "Clostridium"[Organism] AND "complete genome"[All Fields]
 bowtie2_scaling: 25.0
 bowtie2_threads_db: 4
 rank: species
 cores: 20
 mem: 128373

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 20
Rules claiming more threads will be scaled down.
Provided resources: entrez_api=3, mem_mb=128373
Job counts:
    count   jobs
    1   calculate_db_chunks
    1   entrez_db_list
    1   entrez_nuccore_query
    1   entrez_pick_sequences
    1   entrez_taxa_query
    1   index_all_accessions
    1   index_all_db_chunks
    1   randomise_db_order
    8

[Sun Oct  1 22:18:42 2023]
Job 5: Fetching sequence metadata from the NCBI Nucleotide database for the query.

[Sun Oct  1 22:18:50 2023]
Finished job 5.
1 of 8 steps (12%) done

[Sun Oct  1 22:18:50 2023]
Job 6: Querying the NCBI Taxonomy database and fetching taxonomic metadata.

[Sun Oct  1 22:18:54 2023]
Finished job 6.
2 of 8 steps (25%) done

[Sun Oct  1 22:18:54 2023]
Job 4: Selecting the longest sequence per taxon in the entrez query.
Downstream jobs will be updated after completion.

Updating job entrez_db_list.
[Sun Oct  1 22:18:55 2023]
Finished job 4.
3 of 68 steps (4%) done

[Sun Oct  1 22:18:55 2023]
Job 63: Downloading accession NZ_CP133264.1 for taxon Clostridium_sp._OS1-26.

[Sun Oct  1 22:18:55 2023]
Job 33: Downloading accession NZ_CP110856.1 for taxon Clostridium_kluyveri.

[Sun Oct  1 22:18:55 2023]
Job 48: Downloading accession NZ_CP029758.2 for taxon Clostridium_sp._AWRP.

Activating conda environment: /home/grendel/haystac/cache/conda/e18f9b8a5c397a3d990a9b7c0a4d94ce
Activating conda environment: /home/grendel/haystac/cache/conda/e18f9b8a5c397a3d990a9b7c0a4d94ce
Activating conda environment: /home/grendel/haystac/cache/conda/e18f9b8a5c397a3d990a9b7c0a4d94ce
Activating conda environment: /home/grendel/haystac/cache/conda/e18f9b8a5c397a3d990a9b7c0a4d94ce
Activating conda environment: /home/grendel/haystac/cache/conda/e18f9b8a5c397a3d990a9b7c0a4d94ce
Activating conda environment: /home/grendel/haystac/cache/conda/e18f9b8a5c397a3d990a9b7c0a4d94ce
[bgzip] Could not open ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/004/006/395/GCF_004006395.2_ASM400639v2/GCF_004006395.2_ASM400639v2_genomic.fna.gz: Protocol not supported
[bgzip] Could not open ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/030/915/445/GCF_030915445.1_ASM3091544v1/GCF_030915445.1_ASM3091544v1_genomic.fna.gz: Protocol not supported
[bgzip] Could not open ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/026/240/615/GCF_026240615.1_ASM2624061v1/GCF_026240615.1_ASM2624061v1_genomic.fna.gz: Protocol not supported
[bgzip] Could not open ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/004/006/395/GCF_004006395.2_ASM400639v2/GCF_004006395.2_ASM400639v2_genomic.fna.gz: Protocol not supported
haystac: error: Unable to download assembly ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/004/006/395/GCF_004006395.2_ASM400639v2/GCF_004006395.2_ASM400639v2_genomic.fna.gz
None
[bgzip] Could not open ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/026/240/615/GCF_026240615.1_ASM2624061v1/GCF_026240615.1_ASM2624061v1_genomic.fna.gz: Protocol not supported
[bgzip] Could not open ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/030/915/445/GCF_030915445.1_ASM3091544v1/GCF_030915445.1_ASM3091544v1_genomic.fna.gz: Protocol not supported
haystac: error: Unable to download assembly ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/026/240/615/GCF_026240615.1_ASM2624061v1/GCF_026240615.1_ASM2624061v1_genomic.fna.gz
None
haystac: error: Unable to download assembly ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/030/915/445/GCF_030915445.1_ASM3091544v1/GCF_030915445.1_ASM3091544v1_genomic.fna.gz
None
[Sun Oct  1 22:19:13 2023]
Error in rule entrez_download_sequence:
    jobid: 48
    output: /home/grendel/haystac/cache/ncbi/Clostridium_sp._AWRP/NZ_CP029758.2.fasta.gz
    conda-env: /home/grendel/haystac/cache/conda/e18f9b8a5c397a3d990a9b7c0a4d94ce

[Sun Oct  1 22:19:13 2023]
Error in rule entrez_download_sequence:
    jobid: 63
    output: /home/grendel/haystac/cache/ncbi/Clostridium_sp._OS1-26/NZ_CP133264.1.fasta.gz
    conda-env: /home/grendel/haystac/cache/conda/e18f9b8a5c397a3d990a9b7c0a4d94ce

[Sun Oct  1 22:19:13 2023]
Error in rule entrez_download_sequence:
    jobid: 33
    output: /home/grendel/haystac/cache/ncbi/Clostridium_kluyveri/NZ_CP110856.1.fasta.gz
    conda-env: /home/grendel/haystac/cache/conda/e18f9b8a5c397a3d990a9b7c0a4d94ce

RuleException:
CalledProcessError in line 66 of /home/grendel/.conda/envs/haystac/lib/python3.6/site-packages/haystac/workflow/rules/entrez.smk:
Command 'source /sw/apps/conda/latest/rackham_stage/bin/activate '/home/grendel/haystac/cache/conda/e18f9b8a5c397a3d990a9b7c0a4d94ce'; set -euo pipefail;  python /crex/proj/sprok/nobackup/GRNEDEL/Meta_project/Test_folder/Scripts/.snakemake/scripts/tmpbqdkrdmj.entrez_download_sequence.py' returned non-zero exit status 1.
  File "/home/grendel/.conda/envs/haystac/lib/python3.6/site-packages/snakemake/executors/__init__.py", line 2352, in run_wrapper
  File "/home/grendel/.conda/envs/haystac/lib/python3.6/site-packages/haystac/workflow/rules/entrez.smk", line 66, in __rule_entrez_download_sequence
  File "/home/grendel/.conda/envs/haystac/lib/python3.6/site-packages/snakemake/executors/__init__.py", line 569, in _callback
  File "/home/grendel/.conda/envs/haystac/lib/python3.6/concurrent/futures/thread.py", line 56, in run
  File "/home/grendel/.conda/envs/haystac/lib/python3.6/site-packages/snakemake/executors/__init__.py", line 555, in cached_or_run
  File "/home/grendel/.conda/envs/haystac/lib/python3.6/site-packages/snakemake/executors/__init__.py", line 2364, in run_wrapper
Removing output files of failed job entrez_download_sequence since they might be corrupted:
/home/grendel/haystac/cache/ncbi/Clostridium_sp._OS1-26/NZ_CP133264.1.fasta.gz
RuleException:
CalledProcessError in line 66 of /home/grendel/.conda/envs/haystac/lib/python3.6/site-packages/haystac/workflow/rules/entrez.smk:
Command 'source /sw/apps/conda/latest/rackham_stage/bin/activate '/home/grendel/haystac/cache/conda/e18f9b8a5c397a3d990a9b7c0a4d94ce'; set -euo pipefail;  python /crex/proj/sprok/nobackup/GRNEDEL/Meta_project/Test_folder/Scripts/.snakemake/scripts/tmpda3snlyw.entrez_download_sequence.py' returned non-zero exit status 1.
  File "/home/grendel/.conda/envs/haystac/lib/python3.6/site-packages/snakemake/executors/__init__.py", line 2352, in run_wrapper
  File "/home/grendel/.conda/envs/haystac/lib/python3.6/site-packages/haystac/workflow/rules/entrez.smk", line 66, in __rule_entrez_download_sequence
  File "/home/grendel/.conda/envs/haystac/lib/python3.6/site-packages/snakemake/executors/__init__.py", line 569, in _callback
  File "/home/grendel/.conda/envs/haystac/lib/python3.6/concurrent/futures/thread.py", line 56, in run
  File "/home/grendel/.conda/envs/haystac/lib/python3.6/site-packages/snakemake/executors/__init__.py", line 555, in cached_or_run
  File "/home/grendel/.conda/envs/haystac/lib/python3.6/site-packages/snakemake/executors/__init__.py", line 2364, in run_wrapper
RuleException:
CalledProcessError in line 66 of /home/grendel/.conda/envs/haystac/lib/python3.6/site-packages/haystac/workflow/rules/entrez.smk:
Command 'source /sw/apps/conda/latest/rackham_stage/bin/activate '/home/grendel/haystac/cache/conda/e18f9b8a5c397a3d990a9b7c0a4d94ce'; set -euo pipefail;  python /crex/proj/sprok/nobackup/GRNEDEL/Meta_project/Test_folder/Scripts/.snakemake/scripts/tmpdhgoi64n.entrez_download_sequence.py' returned non-zero exit status 1.
  File "/home/grendel/.conda/envs/haystac/lib/python3.6/site-packages/snakemake/executors/__init__.py", line 2352, in run_wrapper
  File "/home/grendel/.conda/envs/haystac/lib/python3.6/site-packages/haystac/workflow/rules/entrez.smk", line 66, in __rule_entrez_download_sequence
  File "/home/grendel/.conda/envs/haystac/lib/python3.6/site-packages/snakemake/executors/__init__.py", line 569, in _callback
  File "/home/grendel/.conda/envs/haystac/lib/python3.6/concurrent/futures/thread.py", line 56, in run
  File "/home/grendel/.conda/envs/haystac/lib/python3.6/site-packages/snakemake/executors/__init__.py", line 555, in cached_or_run
  File "/home/grendel/.conda/envs/haystac/lib/python3.6/site-packages/snakemake/executors/__init__.py", line 2364, in run_wrapper
Trying to restart job 63.
antonisdim commented 11 months ago

Hello Benjamin,

I hope you are doing great and thank you for reporting this !

I will have a look at this issue this week and get back to you ASAP.

Thank you for your patience !

Best, Antony

antonisdim commented 11 months ago

Hello Benjamin,

I hope you are doing great and thank you for your patience !

I ran a few tests on both an Ubuntu and a MacOS machine using the command you provided and each time haystac ran to completion, so unfortunately I was not able to reproduce your error.

After looking into the error message you provided, it seems that this is a bgzip/htslib build related error. More specifically quoting the developers from https://github.com/samtools/htslib/issues/1684

Protocol not supported" ususally means htslib was built without libcurl. While this is possible to do, it generally causes more confusion downstream such as this, so it's best to get it building complete.

and from https://github.com/samtools/htslib/issues/1515

It looks like htslib has been compiled without finding a working copy of curl, so it has no access to fetching references via https. That'll be where the "Protocol not supported" line comes from.

Would you happen to be running haystac on a Debian machine ?

Two possible solutions to that would be:

  1. use conda to install all the dependencies needed by haystac in the environment you are running the program from, and then run haystac config --use-conda False to stop haystac from using the incomplete build of samtools installed in your cache directory. For that you can use this yaml file: https://github.com/antonisdim/haystac_paper/blob/main/performance_tests/environment.yaml
  2. activate this environment that lives in your haystac cache directory '/home/grendel/haystac/cache/conda/e18f9b8a5c397a3d990a9b7c0a4d94ce', and do a full re-installation of samtools with libcurl included.

I will keep looking into this in case a more explicit solution can be found. Of course please let me know if any of the above works, or if any other issues arise.

I hope this helps !

Best, Antony

BenjaminGuinet commented 7 months ago

Hello thanks for the reply, the first idea solved the issue. Thanks for your help. Benjamin