bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
991 stars 354 forks source link

bcbio-nextgen on new computing cluster will not align #3262

Closed drlaurenwasson closed 4 years ago

drlaurenwasson commented 4 years ago

Version info

To Reproduce

SBATCH -p general

SBATCH --job-name=bcbionextgen

SBATCH -c 1

SBATCH -t 12:00:00

SBATCH --mem-per-cpu=10G

SBATCH -e bcbionextgen.err

bcbio_nextgen.py ../config/submission.yaml -n 64 -t ipython -s slurm -q general '-rW=100:00' --retries 3 --timeout 5000

details:

Observed behavior 2020-06-10 10:04:51.649 [IPClusterStart] Loaded config file: /nas/longleaf/home/lwaldron/RNA-seq/submission/work/log/ipython/ipengine_config.py 2020-06-10 10:04:51.649 [IPClusterStart] Looking for ipengine_config in /nas/longleaf/home/lwaldron/RNA-seq/submission/work 2020-06-10 10:04:51.650 [IPClusterStart] Attempting to load config file: ipcluster_6efe6e16_b020_4507_8783_9feb2389b159_config.py 2020-06-10 10:04:51.651 [IPClusterStart] Looking for ipcluster_config in /etc/ipython 2020-06-10 10:04:51.651 [IPClusterStart] Looking for ipcluster_config in /usr/local/etc/ipython 2020-06-10 10:04:51.651 [IPClusterStart] Looking for ipcluster_config in /nas/longleaf/apps/bcbio-nextgen/1.2.0/venv/anaconda/etc/ipython 2020-06-10 10:04:51.651 [IPClusterStart] Looking for ipcluster_config in /nas/longleaf/home/lwaldron/RNA-seq/submission/work/log/ipython 2020-06-10 10:04:51.652 [IPClusterStart] Loaded config file: /nas/longleaf/home/lwaldron/RNA-seq/submission/work/log/ipython/ipcluster_config.py 2020-06-10 10:04:51.652 [IPClusterStart] Looking for ipcluster_config in /nas/longleaf/home/lwaldron/RNA-seq/submission/work

Expected behavior Hello, My job has been stalled at this spot for 4 hours. If I look at the job scheduler I see this: 61199325 bcbionext+ general rc_fconlo+ 1 RUNNING 0:0 61199325.ba+ batch rc_fconlo+ 1 RUNNING 0:0 61199325.ex+ extern rc_fconlo+ 1 RUNNING 0:0

There is a SLURM_controller file which when opened looks like this:

!/bin/sh

SBATCH -p general

SBATCH -J bcbio-c

SBATCH -o bcbio-ipcontroller.out.%A_%a

SBATCH -e bcbio-ipcontroller.err.%A_%a

SBATCH -t 01-00:00:00

SBATCH --cpus-per-task=1

SBATCH -A rc_fconlon_pi

SBATCH --mem=4000

SBATCH --W=100:00

export IPYTHONDIR=/nas/longleaf/home/lwaldron/RNA-seq/submission/work/log/ipython /nas/longleaf/apps/bcbio-nextgen/1.2.0/venv/anaconda/bin/python -E -c 'import resource; cur_proc, max_proc = resource.getrlimit(resource.RLIMIT_NPROC); target_proc = min(max_proc, 10240) if max_proc > 0 else 10240; resource.setrlimit(resource.RLIMIT_NPROC, (max(cur_proc, target_proc), max_proc)); cur_hdls, max_hdls = resource.getrlimit(resource.RLIMIT_NOFILE); target_hdls = min(max_hdls, 10240) if max_hdls > 0 else 10240; resource.setrlimit(resource.RLIMIT_NOFILE, (max(cur_hdls, target_hdls), max_hdls)); from cluster_helper.cluster import VMFixIPControllerApp; VMFixIPControllerApp.launch_instance()' --ip=* --log-to-file --profile-dir="/nas/longleaf/home/lwaldron/RNA-seq/submission/work/log/ipython" --cluster-id="6efe6e16-b020-4507-8783-9feb2389b159" --nodb --hwm=1 --scheme=leastload --HeartMonitor.max_heartmonitor_misses=720 --HeartMonitor.period=5000

Log files Please attach (10MB max): bcbio-nextgen.log, bcbio-nextgen-commands.log, and bcbio-nextgen-debug.log.

Additional context This is the first time I am trying bcbio-nextgen on the UNC computing cluster. I brought it over from HMS, where I learned how to use it. The module was installed by UNC ITS, and I did get this email when I asked them to install it:

"The command in the installation doc for downloading genome data didnt work. None of the options specified in the doc are valid options for that command. Not sure if the user already has the necessary data. If they need this, they should probably contact the authors of this software to see how to get it"

naumenko-sa commented 4 years ago

Hello Lauren @drlaurenwasson !

Yes, bcbio needs to install data: https://bcbio-nextgen.readthedocs.io/en/latest/contents/installation.html#install-data

In your case it could be achieved by running (with bcbio and bcbio tools in PATH):

# install reference genome for mm10
bcbio_nextgen.py -u skip --genomes mm10 --aligners bwa --cores 10
# install RNA-annotation for mm10
bcbio_nextgen.py -u skip --genomes mm10 --datatarget rnaseq
# build STAR index for mm10
bcbio_nextgen.py -u skip --genomes mm10 --aligners star --cores 10

If the admins could follow up here with a particular installation error they see, we may try to resolve it.

Sergey

drlaurenwasson commented 4 years ago

Thank you for your prompt reply @naumenko-sa

I sent them an email and got this response: "A couple of thoughts.

• Is this something you need to install/download once (and then you'll always have it already installed/dowloaded)? If so, you can add something to your path for just one session (so not in your .bashrc to load every time): export PATH=:$PATH • Some users do modify their .bashrc. Keep in mind that loading and removing modules edits your PATH variable. So if you set your PATH variable in your .bashrc you run the real risk that the next time you save a module to your environment or remove one, the module process will overlay, bungle or not be able to make the necessary changes to your PATH. Python and R, for instance, avoid this by using a different variable for local package installs (that can then be saved without risk of interfering with module maintenance): export PYTHONPATH=$HOME/mypython/lib:$PYTHONPATH

It doesn't sound like this one of those."

I wonder if I could install the data on my personal directory (keeping in mind I only have 50GB) after loading the module without having to go through them if necessary? I pointed them to this github page to troubleshoot the error, but we shall see.

I'm also a relative newbie in python. Can you explain what I would need to do to get "with bcbio and bcbio tools in PATH"

Thank you for your help

naumenko-sa commented 4 years ago

You can't really analyze NGS data with 50GB. Some clusters provide extended disk space for users outside of /home/user. Maybe you need to apply for that. Once you have space, you can install bcbio on your own. mm10 files require 20-50G depending on what exactly you are installing as data targets.

In a cluster setting, a shared bcbio installation makes the most sense, because bcbio installs hundreds of bioinformatics packages via conda and most of the databases required for germline/somatic/RNA-seq NGS analyses and many other analyses, see all user stories supported in the docs. Many users could benefit from that one big shared bcbio instance.

By default, bcbio installation script modifies ~/.bashrc (PATH variable). When installing with python bcbio_nextgen_install.py [bcbio_path] --tooldir=[bcbio_tools_path] --nodata --isolate, you have to add two directories in your PATH (~/.bashrc): export PATH=/bcbio_path/anaconda/bin:/bcbio_path/tools/bin:$PATH See more here: https://bcbio-nextgen.readthedocs.io/en/latest/contents/installation.html#installation-parameters

SN