bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
994 stars 354 forks source link

Problem installing genome data #3297

Closed DolapoA closed 4 years ago

DolapoA commented 4 years ago

Version info

To Reproduce Exact bcbio command you have used:

PIPEDIR="/SAN/colcc/lab-software/bcbio-pipeline"

python ${PIPEDIR}/bcbio_nextgen_install.py $PIPEDIR --tooldir ${PIPEDIR}/tools \
--datatarget vep \
--datatarget battenberg \
--datatarget gnomad \
--genomes hg38 \
--genomes hg19 \
--genomes GRCh37 \
--genomes mm9 \
--genomes mm10 \
--genomes phix \
--aligners bwa \

Observed behavior Error message or bcbio output:

Traceback (most recent call last):
  File "/SAN/colcc/pillaylab-software/bcbio-pipeline/anaconda/bin/bcbio_nextgen.py", line 228, in <module>
    install.upgrade_bcbio(kwargs["args"])
  File "/SAN/colcc/pillaylab-software/bcbio-pipeline/anaconda/lib/python3.7/site-packages/bcbio/install.py", line 107, in upgrade_bcbio
  File "/SAN/colcc/pillaylab-software/bcbio-pipeline/anaconda/lib/python3.7/site-packages/bcbio/install.py", line 377, in upgrade_bcbio_data
  File "/home/dajayi/general_output/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/genomes.py", line 354, in install_data_local
    _prep_genomes(env, genomes, genome_indexes, ready_approaches, data_filedir)
  File "/home/dajayi/general_output/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/genomes.py", line 480, in _prep_genomes
    retrieve_fn(env, manager, gid, idx)
  File "/home/dajayi/general_output/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/genomes.py", line 875, in _install_with_ggd
    ggd.install_recipe(os.getcwd(), env.system_install, recipe_file, gid)
  File "/home/dajayi/general_output/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/ggd.py", line 30, in install_recipe
    recipe["recipe"]["full"]["recipe_type"], system_install)
  File "/home/dajayi/general_output/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/ggd.py", line 62, in _run_recipe
    subprocess.check_output(["bash", run_file])
  File "/SAN/colcc/pillaylab-software/bcbio-pipeline/anaconda/lib/python3.7/subprocess.py", line 411, in check_output
  File "/SAN/colcc/pillaylab-software/bcbio-pipeline/anaconda/lib/python3.7/subprocess.py", line 512, in run
subprocess.CalledProcessError: Command '['bash', '/SAN/colcc/pillaylab-software/bcbio-pipeline/genomes/Hsapiens/hg19/txtmp/ggd-run.sh']' returned non-zero exit status 4.
Checking required dependencies
Installing isolated base python installation
Installing mamba
Installing conda-build
Installing bcbio-nextgen
Installing data and third party dependencies
Traceback (most recent call last):
  File "/SAN/colcc/pillaylab-software/bcbio-pipeline/bcbio_nextgen_install.py", line 290, in <module>
    main(parser.parse_args(), sys.argv[1:])
  File "/SAN/colcc/pillaylab-software/bcbio-pipeline/bcbio_nextgen_install.py", line 51, in main
    subprocess.check_call([bcbio, "upgrade"] + _clean_args(sys_argv, args))
  File "/usr/lib64/python2.7/subprocess.py", line 542, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/SAN/colcc/pillaylab-software/bcbio-pipeline/anaconda/bin/bcbio_nextgen.py', 'upgrade', '--tooldir', '/SAN/colcc/pillaylab-software/bcbio-pipeline/tools', '--datatarget', 'vep', '--datatarget', 'battenberg', '--datatarget', 'gnomad', '--genomes', 'hg38', '--genomes', 'hg19', '--genomes', 'GRCh37', '--genomes', 'mm9', '--genomes', 'mm10', '--genomes', 'phix', '--aligners', 'bwa', '--data']' returned non-zero exit status 1

Expected behavior Completed installation, including battenberg, vep and gnomad.

Log files Please attach (10MB max): bcbio_pipeline_installation.txt

naumenko-sa commented 4 years ago

Hi @DolapoA !

This would be a huge installation. You may try to install with --nodata first, and then install datasets one by one with bcbio_nextgen.py upgrade -u skip https://bcbio-nextgen.readthedocs.io/en/latest/contents/installation.html

S

DolapoA commented 4 years ago

Thanks Naumenko, I will try this and get back to you.

D

DolapoA commented 4 years ago

After running the following standard install with nodata:

python ${PIPEDIR}/bcbio_nextgen_install.py $PIPEDIR --tooldir ${PIPEDIR}/tools \
--nodata \

I encountered the following error:

# >>>>>>>>>>>>>>>>>>>>>> ERROR REPORT <<<<<<<<<<<<<<<<<<<<<<

    Traceback (most recent call last):
      File "/SAN/colcc/lab-software/bcbio-pipeline/anaconda/lib/python3.6/site-packages/conda/exceptions.py", line 1079, in __call__
        return func(*args, **kwargs)
      File "/SAN/colcc/lab-software/bcbio-pipeline/anaconda/lib/python3.6/site-packages/conda/cli/main.py", line 84, in _main
        exit_code = do_call(args, p)
      File "/SAN/colcc/lab-software/bcbio-pipeline/anaconda/lib/python3.6/site-packages/conda/cli/conda_argparse.py", line 82, in do_call
        return getattr(module, func_name)(args, parser)
      File "/SAN/colcc/lab-software/bcbio-pipeline/anaconda/lib/python3.6/site-packages/conda/cli/main_install.py", line 20, in execute
        install(args, parser, 'install')
      File "/SAN/colcc/lab-software/bcbio-pipeline/anaconda/lib/python3.6/site-packages/conda/cli/install.py", line 265, in install
        should_retry_solve=(_should_retry_unfrozen or repodata_fn != repodata_fns[-1]),
      File "/SAN/colcc/lab-software/bcbio-pipeline/anaconda/lib/python3.6/site-packages/conda/core/solve.py", line 117, in solve_for_transaction
        should_retry_solve)
      File "/SAN/colcc/lab-software/bcbio-pipeline/anaconda/lib/python3.6/site-packages/conda/core/solve.py", line 158, in solve_for_diff
        force_remove, should_retry_solve)
      File "/SAN/colcc/lab-software/bcbio-pipeline/anaconda/lib/python3.6/site-packages/conda/core/solve.py", line 262, in solve_final_state
        ssc = self._collect_all_metadata(ssc)
      File "/SAN/colcc/lab-software/bcbio-pipeline/anaconda/lib/python3.6/site-packages/conda/common/io.py", line 88, in decorated
        return f(*args, **kwds)
      File "/SAN/colcc/lab-software/bcbio-pipeline/anaconda/lib/python3.6/site-packages/conda/core/solve.py", line 415, in _collect_all_metadata
        index, r = self._prepare(prepared_specs)
      File "/SAN/colcc/lab-software/bcbio-pipeline/anaconda/lib/python3.6/site-packages/conda/core/solve.py", line 1011, in _prepare
        self.subdirs, prepared_specs, self._repodata_fn)
      File "/SAN/colcc/lab-software/bcbio-pipeline/anaconda/lib/python3.6/site-packages/conda/core/index.py", line 228, in get_reduced_index
        repodata_fn=repodata_fn)
      File "/SAN/colcc/lab-software/bcbio-pipeline/anaconda/lib/python3.6/site-packages/conda/core/subdir_data.py", line 105, in query_all
        result = tuple(concat(executor.map(subdir_query, channel_urls)))
      File "/SAN/colcc/lab-software/bcbio-pipeline/anaconda/lib/python3.6/concurrent/futures/_base.py", line 575, in map
        fs = [self.submit(fn, *args) for args in zip(*iterables)]
      File "/SAN/colcc/lab-software/bcbio-pipeline/anaconda/lib/python3.6/site-packages/conda/common/io.py", line 560, in submit
        self._adjust_thread_count()
      File "/SAN/colcc/lab-software/bcbio-pipeline/anaconda/lib/python3.6/concurrent/futures/thread.py", line 142, in _adjust_thread_count
        t.start()
      File "/SAN/colcc/lab-software/bcbio-pipeline/anaconda/lib/python3.6/threading.py", line 846, in start
        _start_new_thread(self._bootstrap, ())
    RuntimeError: can't start new thread

There was also a subprocess error at the end of the log

Upload successful.
Checking required dependencies
Installing isolated base python installation
Installing mamba
Installing conda-build
Installing bcbio-nextgen
Traceback (most recent call last):
  File "/SAN/colcc/pillaylab-software/bcbio-pipeline/bcbio_nextgen_install.py", line 290, in <module>
    main(parser.parse_args(), sys.argv[1:])
  File "/SAN/colcc/pillaylab-software/bcbio-pipeline/bcbio_nextgen_install.py", line 46, in main
    bcbio = install_conda_pkgs(anaconda, args)
  File "/SAN/colcc/pillaylab-software/bcbio-pipeline/bcbio_nextgen_install.py", line 106, in install_conda_pkgs
    "--file", os.path.basename(REMOTES["requirements"])], env=env)
  File "/usr/lib64/python2.7/subprocess.py", line 542, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/SAN/colcc/pillaylab-software/bcbio-pipeline/anaconda/bin/conda', 'install', '--yes', '--file', 'requirements-conda.txt']' returned non-zero exit status 1

But I'm guessing this always shows up when there's a prior error.

Uploaded log file bcbio_pipeline_installation2.txt

DolapoA commented 4 years ago

I've decided to start the installation from scratch in as simple a way possible, bit by bit, I will keep you updated on the progress.

D.

DolapoA commented 4 years ago

Simple run command:

PIPEDIR="/SAN/colcc/lab-software/bcbio-pipeline"
python ${PIPEDIR}/bcbio_nextgen_install.py $PIPEDIR --tooldir ${PIPEDIR}/tools \
--nodata \

Error encountered:

# >>>>>>>>>>>>>>>>>>>>>> ERROR REPORT <<<<<<<<<<<<<<<<<<<<<<

    Traceback (most recent call last):
      File "/SAN/colcc/lab-software/bcbio-pipeline/anaconda/lib/python3.7/site-packages/conda/exceptions.py", line 1079, in __call__
        return func(*args, **kwargs)
      File "/SAN/colcc/lab-software/bcbio-pipeline/anaconda/lib/python3.7/site-packages/mamba/mamba.py", line 809, in exception_converter
        raise e
      File "/SAN/colcc/lab-software/bcbio-pipeline/anaconda/lib/python3.7/site-packages/mamba/mamba.py", line 803, in exception_converter
        exit_code = _wrapped_main(*args, **kwargs)
      File "/SAN/colcc/lab-software/bcbio-pipeline/anaconda/lib/python3.7/site-packages/mamba/mamba.py", line 769, in _wrapped_main
        exit_code = do_call(args, p)
      File "/SAN/colcc/lab-software/bcbio-pipeline/anaconda/lib/python3.7/site-packages/mamba/mamba.py", line 659, in do_call
        exit_code = install(args, parser, 'install')
      File "/SAN/colcc/lab-software/bcbio-pipeline/anaconda/lib/python3.7/site-packages/mamba/mamba.py", line 529, in install
        downloaded = transaction.prompt(PackageCacheData.first_writable().pkgs_dir, repos)
    RuntimeError: Resource temporarily unavailable

`$ /SAN/colcc/lab-software/bcbio-pipeline/anaconda/bin/mamba install --yes --only-deps bcbio-nextgen`

Mamba's implicated.

Installation log: bcbio_pipeline_installation3.txt

naumenko-sa commented 4 years ago

Hi @DolapoA !

I can only think that there is a connection error: https://github.com/TheSnakePit/mamba/blob/master/mamba/mamba.py#L532

From your log I see that you may be running bcbio installation as a SGE job:

SGE_STDERR_PATH=/home/dajayi/general_output/bcbio_pipeline_installation.o2261298
           SGE_STDIN_PATH=/dev/null
          SGE_STDOUT_PATH=/home/dajayi/general_output/bcbio_pipeline_installation.o2261298

If it goes to the compute node, it might have a limited internet connection or proxy server is required. Try to install from a login or transfer node? Ask sysadmins re connection?

Sergey

DolapoA commented 4 years ago

Hi @naumenko-sa,

The simple installation worked well, however, when trying to make additions I've come across an error, not sure what the cause is: Script: bcbio_nextgen.py upgrade -u skip --datatarget gnomad --genomes GRCh37

Error:

[E::bcf_hdr_parse_line] Could not parse the header line: "##contig=<I"
[W::bcf_hdr_parse] Could not parse header line: ##contig=<I
[E::bcf_hdr_parse] Could not parse the header, sample line not found
Failed to open -: could not parse header
Failed to open -: unknown file type
[bcf_ordered_reader.cpp:49 BCFOrderedReader] Not a VCF/BCF file: -
[E:bcf_synced_reader.cpp:87 BCFSyncedReader] - not a VCF or BCF file
Traceback (most recent call last):
  File "/SAN/colcc/pillaylab-software/bcbio-pipeline/tools/bin/bcbio_nextgen.py", line 228, in <module>
    install.upgrade_bcbio(kwargs["args"])
  File "/SAN/colcc/pillaylab-software/bcbio-pipeline/anaconda/lib/python3.6/site-packages/bcbio/install.py", line 107, in upgrade_bcbio
    upgrade_bcbio_data(args, REMOTES)
  File "/SAN/colcc/pillaylab-software/bcbio-pipeline/anaconda/lib/python3.6/site-packages/bcbio/install.py", line 377, in upgrade_bcbio_data
    args.cores, ["ggd", "s3", "raw"])
  File "/home/dajayi/scripts/bcbio_pipeline/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/genomes.py", line 354, in install_data_local
    _prep_genomes(env, genomes, genome_indexes, ready_approaches, data_filedir)
  File "/home/dajayi/scripts/bcbio_pipeline/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/genomes.py", line 480, in _prep_genomes
    retrieve_fn(env, manager, gid, idx)
  File "/home/dajayi/scripts/bcbio_pipeline/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/genomes.py", line 875, in _install_with_ggd
    ggd.install_recipe(os.getcwd(), env.system_install, recipe_file, gid)
  File "/home/dajayi/scripts/bcbio_pipeline/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/ggd.py", line 30, in install_recipe
    recipe["recipe"]["full"]["recipe_type"], system_install)
  File "/home/dajayi/scripts/bcbio_pipeline/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/ggd.py", line 62, in _run_recipe
    subprocess.check_output(["bash", run_file])
  File "/SAN/colcc/pillaylab-software/bcbio-pipeline/anaconda/lib/python3.6/subprocess.py", line 356, in check_output
    **kwargs).stdout
  File "/SAN/colcc/pillaylab-software/bcbio-pipeline/anaconda/lib/python3.6/subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['bash', '/SAN/colcc/pillaylab-software/bcbio-pipeline/genomes/Hsapiens/GRCh37/txtmp/ggd-run.sh']' returned non-zero exit status 1.

Dolapo.

naumenko-sa commented 4 years ago

Hi @DolapoA!

Sorry about the delay!

It is a valid concern - we had to update our grch37 gnomad recipe.

I've updated it in https://github.com/chapmanb/cloudbiolinux/pull/361 Could you please help to test it? (some hits are here): https://github.com/chapmanb/cloudbiolinux/blob/master/doc/hacking.md#testing-a-ggd-recipe

Sergey

IvantheDugtrio commented 4 years ago

Hi @naumenko-sa

Running the new gnomad recipe, I get an error in ggd-run.sh, line 14: vcf_prefix: unbound variable. Looking at how ggd-run.sh works, I think you meant to use url_prefix instead of vcf_prefix for line 14.

Thanks, Ivan

DolapoA commented 4 years ago

Hi @naumenko-sa

I encountered the same error as @IvantheDugtrio As he mentioned, I think that typo could be part of the problem.

Regards, Dolapo.

pfpjs commented 4 years ago

Dear @DolapoA,

The latest cloudbiolinux includes a fix for that specific issue.

Cheers, -- Paulo

DolapoA commented 4 years ago

The error I mention on the 11/08/20 seems to be specifically related to downloading the GRCh37 genome as opposed to gnomad. Or are you saying they're the same?

roryk commented 4 years ago

Hi @DolapoA,

Yup, the problem was in grabbing gnomAD as part of the GRCh37 genome installation. You should be all set now, feel free to reopen if this didn't end up fixing your issue though. Thanks so much!

DolapoA commented 4 years ago

The command I ran:

bcbio_nextgen.py upgrade -u skip --genomes GRCh37

I modified the gnomad recipe as instructed in "In bcbio the alternative instruction is to" with the latest gnomad script however, I got this error when I tried to run the upgrade command above. Please bear in mind I ran it previously with --datatarget gnomad and it seemed to complete the gnomad part, which is why I've left that part out:

Upgrading bcbio
Upgrading bcbio-nextgen data files
List of genomes to get (from the config file at '{'genomes': [{'dbkey': 'GRCh37', 'name': 'Human (GRCh37)', 'indexes': ['seq', 'twobit'], 'annotations': ['GA4GH_problem_regions', 'capture_regions', 'MIG', 'prioritize', 'dbsnp', 'hapmap', '1000g_omni_snps', 'ACMG56_genes', '1000g_snps', 'mills_indels', 'clinvar', 'cosmic', 'ancestral', 'qsignature', 'genesplicer', 'effects_transcripts', 'varpon', 'vcfanno', 'viral', 'transcripts', 'RADAR', 'fusion-blacklist', 'mirbase'], 'validation': ['giab-NA12878', 'giab-NA24385', 'giab-NA24631', 'dream-syn3', 'dream-syn4', 'giab-NA12878-NA24385-somatic', 'giab-NA24143', 'giab-NA24149', 'giab-NA24694', 'giab-NA24695']}], 'genome_indexes': ['rtg'], 'install_liftover': False, 'install_uniref': False}'): Human (GRCh37)
Running GGD recipe: GRCh37 srnaseq 20180710
2020-09-30 12:30:31 URL: ftp://mirbase.org/pub/mirbase/20/genomes/hsa.gff3 [519390] -> "hsahg19.gff3" [1]

gzip: refGene.txt.gz: decompression OK, trailing garbage ignored
Traceback (most recent call last):
  File "/SAN/colcc/pillaylab-software/bcbio-pipeline/tools/bin/bcbio_nextgen.py", line 228, in <module>
    install.upgrade_bcbio(kwargs["args"])
  File "/SAN/colcc/pillaylab-software/bcbio-pipeline/anaconda/lib/python3.6/site-packages/bcbio/install.py", line 107, in upgrade_bcbio
    upgrade_bcbio_data(args, REMOTES)
  File "/SAN/colcc/pillaylab-software/bcbio-pipeline/anaconda/lib/python3.6/site-packages/bcbio/install.py", line 377, in upgrade_bcbio_data
    args.cores, ["ggd", "s3", "raw"])
  File "/home/dajayi/scripts/bcbio_pipeline/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/genomes.py", line 354, in install_data_local
    _prep_genomes(env, genomes, genome_indexes, ready_approaches, data_filedir)
  File "/home/dajayi/scripts/bcbio_pipeline/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/genomes.py", line 480, in _prep_genomes
    retrieve_fn(env, manager, gid, idx)
  File "/home/dajayi/scripts/bcbio_pipeline/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/genomes.py", line 875, in _install_with_ggd
    ggd.install_recipe(os.getcwd(), env.system_install, recipe_file, gid)
  File "/home/dajayi/scripts/bcbio_pipeline/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/ggd.py", line 30, in install_recipe
    recipe["recipe"]["full"]["recipe_type"], system_install)
  File "/home/dajayi/scripts/bcbio_pipeline/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/ggd.py", line 62, in _run_recipe
    subprocess.check_output(["bash", run_file])
  File "/SAN/colcc/pillaylab-software/bcbio-pipeline/anaconda/lib/python3.6/subprocess.py", line 356, in check_output
    **kwargs).stdout
  File "/SAN/colcc/pillaylab-software/bcbio-pipeline/anaconda/lib/python3.6/subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['bash', '/SAN/colcc/pillaylab-software/bcbio-pipeline/genomes/Hsapiens/GRCh37/txtmp/ggd-run.sh']' returned non-zero exit status 2.
roryk commented 4 years ago

Thanks, looks like the mirbase installation script isn't working correctly, looking at it now.

roryk commented 4 years ago

Hi @DolapoA,

I think this is due to zcat not being the same as gunzip -c on your system. We haven't run into this before, but it looks like that is a thing on some UNIX systems (see https://en.wikibooks.org/wiki/Guide_to_Unix/Commands/File_Compression#zcat). If you nuke the tmpbcbio-install directory where you were running the upgrades you should get an updated recipe that fixes this. Unfortunately some of the data you might have installed already might be corrupted, sorry about that. To be safe I'd nuke the install and start over.

DolapoA commented 4 years ago

Thanks I'll try that.

roryk commented 4 years ago

Let me know if this doesn't fix it-- the other reason why this might not be working is wget is not able to download the files. If that is the case I think I have a fix for that as well, and that won't require you to re-install.

naumenko-sa commented 4 years ago

Sorry, I've reverted it: no gzcat on 3 Linux systems I checked up (CentOS, CentOS, Fedora). I have only found it in MacOS.

naumenko-sa commented 4 years ago

I just have run this recipe, it runs ok: https://github.com/chapmanb/cloudbiolinux/blob/master/ggd-recipes/hg19/mirbase.yaml

Could you please try to re-run it? Maybe it was a mirbase server issue?

roryk commented 4 years ago

Thank you!