Closed tomsing1 closed 7 years ago
Thomas; Sorry about the problems, and thanks for the detailed report. I'm not sure exactly what is going wrong but the best way to debug is to try and submit the batch script outside of bcbio:
sbatch ./SLURM_controller7ebeddb7-8cb1-489a-9a9b-a84358a7ed35
This is failing for some reason but unfortunately ipython swallows the error message. Hopefully doing that should give you a useful output that will help us identify the issue. Thanks much.
Thanks a ton for your instantaneous reply. I will submit the script and report back!
When I submitted the job via
sbatch ./SLURM_controller7ebeddb7-8cb1-489a-9a9b-a84358a7ed35
sbatch: error: Batch job submission failed: Requested node configuration is not available
it was clear that the slurm script requested more memory than is available. It includes the following line:
#SBATCH --mem=4000
but each c3.large instance only provides 3.75 Gb
of memory. When I adjusted it to #SBATCH --mem=3000
, the job was submitted without problems.
Manually editing the script or switchting to larger EC2 instances fixes the issue, thanks a lot for pointing out how to troubleshoot!
One more question: Are the system requirements (e.g. RAM) for the different pipelines or tools documented somewhere? Or perhaps you have recommendations as to which instance type(s) to use for real datasets?
Thomas; Glad that helped debug the initial issue and get past that. I'll follow up on #159 since it only really helps if you get unstuck and can actually process things.
Regarding resource usage, this is hard to give a ballpark on without more details about what you're trying to run. We typically do not run clusters on AWS for smaller number of samples since you can get pretty high scale machines with balanced CPU/memory using the m4 series (m4.4xlarge = 16 cores, m4.10xlarge = 40 cores, m4.16xlarge = 64 cores). This saves the overhead of dealing with SLURM and a shared filesystem and also lets you stop/start as needed and use spot instances more easily.
This is not as automated but we have documentation and ansible scripts to help set it up:
https://github.com/chapmanb/bcbio-nextgen/tree/master/scripts/ansible
Happy to help more with that if that seems like a more cost-effective approach for your work.
Thanks a lot, especially for the pointer to the ansible script! I hadn't been aware of this simplified approach, yet.
I am following the instructions on Mac OS X, but I am running into an error (see below). (Please note that I am using a fresh conda environment with ansible 1.9.4.)
# create conda environment with python 2.7
conda create --name ansible python=2 ansible boto
source activate ansible
# provide ansible host file to avoid the following error:
# ERROR: Unable to find an inventory file, specify one with -i ?
echo "localhost ansible_connection=local ansible_python_interpreter=python" > ansible_hosts
export ANSIBLE_INVENTORY=ansible_hosts
# create encrypted volume
VolumeId=$(aws ec2 create-volume --size 300 --availability-zone us-west-2a --query VolumeId)
aws ec2 create-tags --resources "${VolumeId}" --tags Key=Name,Value=bcbio-rnaseq
# create project_vars.yaml
cat <<- EOF > project_vars.yaml
instance_type: t2.small
spot_price: null
image_id: ami-436c6573
vpc_subnet: ****redacted*****
volume: ${VolumeId}
security_group: bcbio_cluster_sg
keypair: ****redacted*****
iam_role: bcbio_full_s3_access
region: us-west-2
EOF
# download ansible playbook
wget \
https://raw.githubusercontent.com/chapmanb/bcbio-nextgen/master/scripts/ansible/launch_aws.yaml
# execute ansible playbook
ansible-playbook -vvv launch_aws.yaml
PLAY [localhost] **************************************************************
TASK: [include_vars project_vars.yaml] ****************************************
ok: [localhost] => {"ansible_facts": {"iam_role": "bcbio_full_s3_access", "image_id": "ami-436c6573", "instance_type": "t2.small", "keypair": "sandmann-public-key", "region": "us-west-2", "security_group": "bcbio_cluster_sg", "spot_price": null, "volume": "vol-0730b69ae8ac1e081", "vpc_subnet": "subnet-3f499a5b"}}
TASK: [Launch EC2 instance] ***************************************************
<127.0.0.1> REMOTE_MODULE ec2 spot_price='' state=present instance_type=t2.small keypair=sandmann-public-key vpc_subnet_id=subnet-3f499a5b image=ami-436c6573 instance_profile_name=bcbio_full_s3_access group=bcbio_cluster_sg
<127.0.0.1> EXEC ['/bin/sh', '-c', 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1482106309.06-113073174003213 && chmod a+rx $HOME/.ansible/tmp/ansible-tmp-1482106309.06-113073174003213 && echo $HOME/.ansible/tmp/ansible-tmp-1482106309.06-113073174003213']
<127.0.0.1> PUT /var/folders/rq/q19k6q511wqglz6kt7jl_t_r0000gp/T/tmp4je7wr TO /Users/sandmann/.ansible/tmp/ansible-tmp-1482106309.06-113073174003213/ec2
<127.0.0.1> EXEC ['/bin/sh', '-c', u'LANG=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 python /Users/sandmann/.ansible/tmp/ansible-tmp-1482106309.06-113073174003213/ec2; rm -rf /Users/sandmann/.ansible/tmp/ansible-tmp-1482106309.06-113073174003213/ >/dev/null 2>&1']
failed: [localhost -> 127.0.0.1] => {"failed": true}
msg: Either region or ec2_url must be specified
FATAL: all hosts have already failed -- aborting
PLAY RECAP ********************************************************************
to retry, use: --limit @/Users/sandmann/launch_aws.yaml.retry
localhost : ok=1 changed=0 unreachable=0 failed=1
conda list
# packages in environment at /Users/sandmann/anaconda/envs/ansible:
#
ansible 1.9.4 py27_0 bioconda
boto 2.43.0 py27_0
cffi 1.9.1 py27_0
cryptography 1.6 py27_0
enum34 1.1.6 py27_0
httplib2 0.9.2 py27_0 bioconda
idna 2.1 py27_0
ipaddress 1.0.17 py27_0
jinja2 2.8 py27_1
markupsafe 0.23 py27_2
openssl 1.0.2j 0
paramiko 2.0.2 py27_0
pip 9.0.1 py27_1
pyasn1 0.1.9 py27_0
pycparser 2.17 py27_0
pycrypto 2.6.1 py27_4
python 2.7.12 1
pyyaml 3.12 py27_0
readline 6.2 2
setuptools 27.2.0 py27_0
six 1.10.0 py27_0
sqlite 3.13.0 0
tk 8.5.18 0
wheel 0.29.0 py27_0
yaml 0.1.6 0
zlib 1.2.8 3
P.S.: defining the AWS_DEFAULT_REGION doesn't help.
export AWS_DEFAULT_REGION=us-west-2
Thomas;
Apologies about the issues with the initial test. I pushed a fix to the ansible script to pass region
to the ansible ec2 instance creation so hopefully it will work cleanly with your project_vars.yaml
if you grab the latest and restart.
Please let us know if you run into any other problems. The ansible scripts are still a work in progress so happy for feedback on pain points and issues.
Great! Thanks for updating the ansible playbook so quickly. I think the region
also needs to be included in the Attach working volume
step, e.g. by expanding the section like this:
- name: Attach working volume
local_action:
module: ec2_vol
instance: "{{ item.id }}"
id: "{{ volume }}"
device_name: /dev/xvdf
state: present
region: "{{ region }}"
with_items: "{{ ec2.instances }}"
With this modification, the instance is started and the volume is added.
The GATHERING FACTS
step still prompts me about an unknown host, though:
The authenticity of host ' (::1)' can't be established.
ECDSA key fingerprint is SHA256:7vRNkf7oEygLf++IWAYpLuhOECfjACY/5t4+GgAuUrI.
Are you sure you want to continue connecting (yes/no)
I checked the $HOME/.ssh/known_hosts
file and the IP is listed there, as expected. Any idea why I am still prompted for an interactive answer?
Thanks again for looking into this! Please let me know what would be useful for you, eg if I can test things out for you.
Thomas; Thanks again for the detailed report and sorry about the continued stumbling blocks. I pushed a fix that I believe will resolve this by setting the SSH configuration options for ansible on the launched hosts rather than trying to update local ssh directly. If you update from the latest GitHub version I hope it'll work cleanly this time. Please let me know if you have any other issues at all.
I have successfully followed the instructions and created a cluster (1 head node + 2 compute nodes, each c3-large instances) in the us-west-1 zone on AWS.
I can successfully log into the head node and start a small RNA-seq workflow on the node itself:
But when I try to submit the same workflow to the worker nodes, it seems that the ipython controller fails. I can see the submission job and a
bcbio-c
job in the queue, but the latter fails immediately:Here is the content of the
SLURM_controller7ebeddb7-8cb1-489a-9a9b-a84358a7ed35
file in the work directory:The slurm log file contains
Finally, the
log/ipython/log/ipcluster-18cc7e8c-3b67-476d-906e-06098a524f2c-4351.log
file contains anERROR | Controller start failed
error message:The
bcbio_submit.sh
file contains:Here is the list of packages available to conda on the head node, in case that is helpful:
Any idea what might be going on?