blobtoolkit / pipeline

[Archived] SnakeMake pipeline to run BlobTools on public assemblies
https://blobtoolkit.genomehubs.org
MIT License
10 stars 4 forks source link

Local pipeline fails without internet access #8

Closed kubu4 closed 3 years ago

kubu4 commented 3 years ago

I'm attempting to run the pipeline on a computing cluster at the Univ. of Washington. The computing nodes that are controlled via SLURM do not have internet access. As such, when running the pipeline, I get the following error:

CreateCondaEnvironmentException:
Could not create conda environment from /gscratch/srlab/programs/blobtoolkit/insdc-pipeline/rules/../envs/busco.yaml:

# >>>>>>>>>>>>>>>>>>>>>> ERROR REPORT <<<<<<<<<<<<<<<<<<<<<<

    Traceback (most recent call last):
      File "/gscratch/srlab/programs/anaconda3/lib/python3.7/site-packages/conda/exceptions.py", line 1079, in __call__
        return func(*args, **kwargs)
      File "/gscratch/srlab/programs/anaconda3/lib/python3.7/site-packages/conda_env/cli/main.py", line 80, in do_call
        exit_code = getattr(module, func_name)(args, parser)
      File "/gscratch/srlab/programs/anaconda3/lib/python3.7/site-packages/conda_env/cli/main_create.py", line 119, in execute
        result[installer_type] = installer.install(prefix, pkg_specs, args, env)
      File "/gscratch/srlab/programs/anaconda3/lib/python3.7/site-packages/mamba/mamba_env.py", line 45, in mamba_install
        pool, tuple(ordered_channels_dict.keys()), repos, prepend=False
      File "/gscratch/srlab/programs/anaconda3/lib/python3.7/site-packages/mamba/utils.py", line 99, in load_channels
        use_cache=use_cache,
      File "/gscratch/srlab/programs/anaconda3/lib/python3.7/site-packages/mamba/utils.py", line 74, in get_index
        is_downloaded = dlist.download(True)
    RuntimeError: Download error (28) Timeout was reached [https://conda.anaconda.org/conda-forge/noarch/repodata.json]
    Failed to connect to conda.anaconda.org port 443: Connection timed out

Is there a means by which to get around the need for internet access?

I see this seems to be related to setting up the BUSCO conda environment. Could I set up the environment manually and then run the pipeline? I suspect the pipeline will still attempt to connect to the internet, even if the environment has already been set up, but I'm not positive.

Are there other similar sticking points that will require internet access when running a local pipeline?

Finally, I suppose the real answer to all of this is to run this in a container (Docker/Singularity), but I'm also not entirely sure if those will also require internet access during the process.

rjchallis commented 3 years ago

Snakemake has a --conda-create-envs-only flag that is designed for this scenario:

Conda deployment also works well for offline or air-gapped environments. Running snakemake --use-conda --conda-create-envs-only will only install the required conda environments without running the full workflow. Subsequent runs with --use-conda will make use of the local environments without requiring internet access.

(from the snakemake docs)

So hopefully you will be able to run this on the head node to create the environments before submitting the pipeline.

The Docker image has all of the dependencies pre-installed (which makes it rather a large image) so shouldn't require any internet access.

kubu4 commented 3 years ago

Thanks! That got me past the conda environment installs!

However, now the job is dying because it's trying to download the local assembly from NCBI, lineages from BUSCO, and databases UniProt, despite this being a local assembly. I see in Issue #6 that person manually commented out all of the fetch commands in Snakefile_v2. Is that the solution to this? Also, where might I find that file? I can't find it in the .snakmake/ directory (which is in my working directory).

kubu4 commented 3 years ago

where might I find that file?

Found it:

blobtoolkit/insdc-pipeline/Snakefile

Have commented out the various fetch rules. Fingers crossed...

kubu4 commented 3 years ago

This seems to eliminate errors related to not having internet access.