ipython error when trying to submit jobs on AWS cluster: ERROR

tomsing1 commented 7 years ago

I have successfully followed the instructions and created a cluster (1 head node + 2 compute nodes, each c3-large instances) in the us-west-1 zone on AWS.

bcbio_vm.py aws info
Available clusters: bcbio

Configuration for cluster 'bcbio':
 Frontend: c3.large with 200Gb NFS storage
 Cluster: 2 c3.large machines

AWS setup:
 OK: expected IAM user 'bcbio' exists.
 OK: expected security group 'bcbio_cluster_sg' exists.
 OK: VPC 'bcbio' exists.

Instances in VPC 'bcbio':
    bcbio-compute001 (c3.large, running) at 54.183.195.201 in us-west-1a
    bcbio-frontend001 (c3.large, running) at 54.153.119.90 in us-west-1a
    bcbio-compute002 (c3.large, running) at 54.183.16.60 in us-west-1a

I can successfully log into the head node and start a small RNA-seq workflow on the node itself:

mkdir /encrypted/project1
cd !$ && mkdir work && cd work
bcbio_vm.py run -n 4 s3://my-bucket@us-west-1/test/project1.yaml
Running in docker container: 259809a73de97f02261d64f088b086752687ad68c92ec40408097623f4df5e60
[2016-12-17T23:54Z] System YAML configuration: /mnt/work/bcbio_system-forvm-merged.yaml
[2016-12-17T23:54Z] Created tmp dir /mnt/work/bcbiotx/tmpAsqtbw
[2016-12-17T23:54Z] Resource requests: cutadapt, picard; memory: 1.75, 1.75; cores: 2, 2
[2016-12-17T23:54Z] Configuring 2 jobs to run, using 1 cores each with 1.75g of memory reserved for each job
[2016-12-17T23:54Z] Timing: organize samples
[2016-12-17T23:54Z] multiprocessing: organize_samples
[2016-12-17T23:54Z] Using input YAML configuration: /mnt/work/bcbio_sample-forvm.yaml
[2016-12-17T23:54Z] Created tmp dir /mnt/work/bcbiotx/tmpGLlCYx
[...]

But when I try to submit the same workflow to the worker nodes, it seems that the ipython controller fails. I can see the submission job and a bcbio-c job in the queue, but the latter fails immediately:

bcbio_vm.py ipythonprep s3://my-bucket@us-west-1/test/project1.yaml slurm cloud -n 4
sbatch bcbio_submit.sh

sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
8            bcbio_sub+      cloud        gc3          1    RUNNING      0:0
9               bcbio-c      cloud        gc3          0     FAILED      0:1

Here is the content of the SLURM_controller7ebeddb7-8cb1-489a-9a9b-a84358a7ed35 file in the work directory:

cat SLURM_controller7ebeddb7-8cb1-489a-9a9b-a84358a7ed35
#!/bin/sh
#SBATCH -p cloud
#SBATCH -J bcbio-c
#SBATCH -o bcbio-ipcontroller.out.%j
#SBATCH -e bcbio-ipcontroller.err.%j
#SBATCH -t 00-00:00:00
#SBATCH --cpus-per-task=1
#SBATCH -A gc3
#SBATCH --mem=4000

/home/ubuntu/install/bcbio-vm/data/anaconda/bin/python -E -c 'import resource; cur_proc, max_proc = resource.getrlimit(resource.RLIMIT_NPROC); target_proc = min(max_proc, 10240) if max_proc > 0 else 10240; resource.setrlimit(resource.RLIMIT_NPROC, (max(cur_proc, target_proc), max_proc)); cur_hdls, max_hdls = resource.getrlimit(resource.RLIMIT_NOFILE); target_hdls = min(max_hdls, 10240) if max_hdls > 0 else 10240; resource.setrlimit(resource.RLIMIT_NOFILE, (max(cur_hdls, target_hdls), max_hdls)); from cluster_helper.cluster import VMFixIPControllerApp; VMFixIPControllerApp.launch_instance()' --ip=* --log-to-file --profile-dir="/encrypted/project1/work/log/ipython" --cluster-id="18cc7e8c-3b67-476d-906e-06098a524f2c" --nodb --hwm=1 --scheme=leastload --HeartMonitor.max_heartmonitor_misses=720 --HeartMonitor.period=5000

The slurm log file contains

cat slurm-8.out
[2016-12-17T23:57Z] compute001: System YAML configuration: /encrypted/project1/work/bcbio_system-prep.yaml
[2016-12-17T23:57Z] compute001: Created tmp dir /encrypted/project1/work/bcbiotx/tmpE8uYqi
[2016-12-17T23:57Z] compute001: Resource requests: cutadapt, picard; memory: 1.75, 1.75; cores: 2, 2
[2016-12-17T23:57Z] compute001: Configuring 2 jobs to run, using 1 cores each with 1.75g of memory reserved for each job
[ProfileCreate] Generating default config file: u'/encrypted/project1/work/log/ipython/ipython_config.py'
[ProfileCreate] Generating default config file: u'/encrypted/project1/work/log/ipython/ipython_kernel_config.py'
[ProfileCreate] Generating default config file: u'/encrypted/project1/work/log/ipython/ipcontroller_config.py'
[ProfileCreate] Generating default config file: u'/encrypted/project1/work/log/ipython/ipengine_config.py'
[ProfileCreate] Generating default config file: u'/encrypted/project1/work/log/ipython/ipcluster_config.py'
2016-12-17 23:57:32.396 [IPClusterStart] Using existing profile dir: u'/encrypted/project1/work/log/ipython'
2016-12-17 23:57:32.396 [IPClusterStart] Searching path [u'/encrypted/project1/work', u'/encrypted/project1/work/log/ipython', '/usr/local/etc/ipython', '/etc/ipython'] for config files
2016-12-17 23:57:32.397 [IPClusterStart] Attempting to load config file: ipython_config.py
2016-12-17 23:57:32.397 [IPClusterStart] Looking for ipython_config in /etc/ipython
2016-12-17 23:57:32.397 [IPClusterStart] Looking for ipython_config in /usr/local/etc/ipython
2016-12-17 23:57:32.397 [IPClusterStart] Looking for ipython_config in /encrypted/project1/work/log/ipython
2016-12-17 23:57:32.398 [IPClusterStart] Loaded config file: /encrypted/project1/work/log/ipython/ipython_config.py
2016-12-17 23:57:32.398 [IPClusterStart] Looking for ipython_config in /encrypted/project1/work
2016-12-17 23:57:32.399 [IPClusterStart] Attempting to load config file: ipcluster_18cc7e8c_3b67_476d_906e_06098a524f2c_config.py
2016-12-17 23:57:32.399 [IPClusterStart] Looking for ipcontroller_config in /etc/ipython
2016-12-17 23:57:32.399 [IPClusterStart] Looking for ipcontroller_config in /usr/local/etc/ipython
2016-12-17 23:57:32.399 [IPClusterStart] Looking for ipcontroller_config in /encrypted/project1/work/log/ipython
2016-12-17 23:57:32.400 [IPClusterStart] Loaded config file: /encrypted/project1/work/log/ipython/ipcontroller_config.py
2016-12-17 23:57:32.400 [IPClusterStart] Looking for ipcontroller_config in /encrypted/project1/work
2016-12-17 23:57:32.401 [IPClusterStart] Attempting to load config file: ipcluster_18cc7e8c_3b67_476d_906e_06098a524f2c_config.py
2016-12-17 23:57:32.401 [IPClusterStart] Looking for ipengine_config in /etc/ipython
2016-12-17 23:57:32.401 [IPClusterStart] Looking for ipengine_config in /usr/local/etc/ipython
2016-12-17 23:57:32.401 [IPClusterStart] Looking for ipengine_config in /encrypted/project1/work/log/ipython
2016-12-17 23:57:32.402 [IPClusterStart] Loaded config file: /encrypted/project1/work/log/ipython/ipengine_config.py
2016-12-17 23:57:32.402 [IPClusterStart] Looking for ipengine_config in /encrypted/project1/work
2016-12-17 23:57:32.403 [IPClusterStart] Attempting to load config file: ipcluster_18cc7e8c_3b67_476d_906e_06098a524f2c_config.py
2016-12-17 23:57:32.403 [IPClusterStart] Looking for ipcluster_config in /etc/ipython
2016-12-17 23:57:32.403 [IPClusterStart] Looking for ipcluster_config in /usr/local/etc/ipython
2016-12-17 23:57:32.403 [IPClusterStart] Looking for ipcluster_config in /encrypted/project1/work/log/ipython
2016-12-17 23:57:32.404 [IPClusterStart] Loaded config file: /encrypted/project1/work/log/ipython/ipcluster_config.py
2016-12-17 23:57:32.404 [IPClusterStart] Looking for ipcluster_config in /encrypted/project1/work

Finally, the log/ipython/log/ipcluster-18cc7e8c-3b67-476d-906e-06098a524f2c-4351.log file contains an ERROR | Controller start failed error message:

cat log/ipython/log/ipcluster-18cc7e8c-3b67-476d-906e-06098a524f2c-4351.log
2016-12-17 23:57:32.517 [IPClusterStart] Starting ipcluster with [daemon=True]
2016-12-17 23:57:32.524 [IPClusterStart] Creating pid file: /encrypted/project1/work/log/ipython/pid/ipcluster-18cc7e8c-3b67-476d-906e-06098a524f2c.pid
2016-12-17 23:57:32.527 [IPClusterStart] Starting Controller with cluster_helper.cluster.BcbioSLURMControllerLauncher
2016-12-17 23:57:32.528 [IPClusterStart] Starting BcbioSLURMControllerLauncher: ['sbatch', u'./SLURM_controller7ebeddb7-8cb1-489a-9a9b-a84358a7ed35']
2016-12-17 23:57:32.528 [IPClusterStart] adding PBS queue settings to batch script
2016-12-17 23:57:32.529 [IPClusterStart] Writing batch script: ./SLURM_controller7ebeddb7-8cb1-489a-9a9b-a84358a7ed35
2016-12-17 23:57:32.542 [IPClusterStart] ERROR | Controller start failed
Traceback (most recent call last):
  File "/home/ubuntu/install/bcbio-vm/data/anaconda/lib/python2.7/site-packages/ipyparallel/apps/ipclusterapp.py", line 503, in start_controller
    self.controller_launcher.start()
  File "/home/ubuntu/install/bcbio-vm/data/anaconda/lib/python2.7/site-packages/cluster_helper/cluster.py", line 506, in start
    return super(BcbioSLURMControllerLauncher, self).start(1)
  File "/home/ubuntu/install/bcbio-vm/data/anaconda/lib/python2.7/site-packages/ipyparallel/apps/launcher.py", line 1151, in start
    output = check_output(self.args, env=os.environ)
  File "/home/ubuntu/install/bcbio-vm/data/anaconda/lib/python2.7/subprocess.py", line 574, in check_output
    raise CalledProcessError(retcode, cmd, output=output)
CalledProcessError: Command '['sbatch', u'./SLURM_controller7ebeddb7-8cb1-489a-9a9b-a84358a7ed35']' returned non-zero exit status 1

The bcbio_submit.sh file contains:

cat bcbio_submit.sh
#!/bin/bash
#SBATCH --cpus-per-task=1
#SBATCH --mem=2000
#SBATCH -p cloud
#SBATCH -t 0
bcbio_vm.py ipython /encrypted/project1/work/config/project1.yaml slurm cloud --numcores 4 -r timelimit=0 --timeout 15

Here is the list of packages available to conda on the head node, in case that is helpful:

~/install/bcbio-vm/data/anaconda/bin/conda list
# packages in environment at /home/ubuntu/install/bcbio-vm/data/anaconda:
#
_nb_ext_conf              0.3.0                    py27_0
anaconda-client           1.6.0                    py27_0
ansible                   1.9.4                    py27_0    bioconda
argcomplete               1.0.0                    py27_1
argh                      0.26.2                   py27_0    bioconda
arrow                     0.7.0                    py27_0    bioconda
arvados-cwl-runner        1.0.20161123235904          py27_0    bioconda
arvados-python-client     0.1.20161123074954          py27_0    bioconda
avro-python2              1.8.1                    py27_0    bioconda
azure                     1.0.2                    py27_0
backports                 1.0                      py27_0
backports_abc             0.5                      py27_0
bcbio-nextgen             1.0.1a0                  py27_2    bioconda
bcbio-nextgen-vm          0.1.0a                  py27_88    bioconda
bcftools                  1.3.1                         1    bioconda
bd2k-python-lib           1.14a1.dev37             py27_0    bioconda
bioblend                  0.8.0                    py27_0    bioconda
biopython                 1.68                np111py27_0
boto                      2.43.0                   py27_0
cachecontrol              0.11.7                   py27_0
cairo                     1.12.18                       6
cffi                      1.9.1                    py27_0
cgcloud-lib               1.4a1.dev195             py27_0    bioconda
ciso8601                  1.0.1                    py27_0    bioconda
click                     6.6                      py27_0    bioconda
clyent                    1.2.2                    py27_0
conda                     4.2.13                   py27_0
conda-env                 2.6.0                         0
configobj                 5.0.6                    py27_0
configparser              3.5.0                    py27_0
coverage                  4.2                      py27_0
cryptography              1.6                      py27_0
curl                      7.45.0                        2    bioconda
cwltest                   1.0.20161124105442          py27_0    bioconda
cwltool                   1.0.20161123190203          py27_0    bioconda
cycler                    0.10.0                   py27_0
cython                    0.25.2                   py27_0
cyvcf2                    0.5.5                    py27_0    bioconda
dbus                      1.10.10                       0
decorator                 4.0.10                   py27_1
dill                      0.2.5                    py27_0
elasticluster             0.1.3bcbio              py27_11    bioconda
entrypoints               0.2.2                    py27_0
enum34                    1.1.6                    py27_0
et_xmlfile                1.0.1                    py27_0
expat                     2.1.0                         0    bioconda
fabric                    1.13.1                   py27_0
fadapa                    0.3.1                    py27_0    bioconda
fontconfig                2.11.1                        6
freetype                  2.5.5                         1
funcsigs                  1.0.2                    py27_0
functools32               3.2.3.2                  py27_0    bioconda
futures                   3.0.5                    py27_0
gcs-oauth2-boto-plugin    1.9                      py27_0    bioconda
get_terminal_size         1.0.0                    py27_0
gffutils                  0.8.7.1                  py27_1    bioconda
glib                      2.43.0                        1
google-api-python-client  1.4.2                    py27_0    bioconda
gst-plugins-base          1.8.0                         0
gstreamer                 1.8.0                         0
html5lib                  0.999                    py27_0
htslib                    1.3.1                         1    bioconda
httplib2                  0.9.2                    py27_0    bioconda
icu                       54.1                          0
idna                      2.1                      py27_0
ipaddress                 1.0.17                   py27_0
ipykernel                 4.5.2                    py27_0
ipyparallel               4.1.0                    py27_0
ipython                   5.1.0                    py27_0
ipython-cluster-helper    0.5.3                    py27_0    bioconda
ipython_genutils          0.1.0                    py27_0
ipywidgets                5.2.2                    py27_0
isodate                   0.5.4                    py27_0    bioconda
jbig                      2.1                           0
jdcal                     1.3                      py27_0
jinja2                    2.8                      py27_1
joblib                    0.9.4                    py27_0
jpeg                      8d                            2
jsonschema                2.5.1                    py27_0
junit-xml                 1.7                      py27_0    bioconda
jupyter_client            4.4.0                    py27_0
jupyter_core              4.2.1                    py27_0
libffi                    3.2.1                         1
libgcc                    5.2.0                         0
libgfortran               3.0.0                         1
libiconv                  1.14                          0
libpng                    1.6.22                        0
libsodium                 0.4.5                         0
libtiff                   4.0.6                         2
libxcb                    1.12                          1
libxml2                   2.9.4                         0
libxslt                   1.1.28                        3
lockfile                  0.12.2                   py27_0
logbook                   0.12.2                   py27_0    bioconda
lxml                      3.6.4                    py27_0
markupsafe                0.23                     py27_2
matplotlib                1.5.3               np111py27_1
mistune                   0.7.3                    py27_0
mkl                       11.3.3                        0
mock                      2.0.0                    py27_0
msgpack-python            0.4.8                    py27_0
nb_anacondacloud          1.2.0                    py27_0
nb_conda                  2.0.0                    py27_0
nb_conda_kernels          2.0.0                    py27_0
nbconvert                 4.2.0                    py27_0
nbformat                  4.2.0                    py27_0
nbpresent                 3.0.2                    py27_0
netifaces                 0.10.4                   py27_1    bioconda
nodejs                    4.4.1                         1
nose                      1.3.7                    py27_1
notebook                  4.3.0                    py27_0
numpy                     1.11.2                   py27_0
oauth2client              1.5.2                    py27_0    bioconda
openpyxl                  2.4.0                    py27_0    bioconda
openssl                   1.0.2g                        0
pandas                    0.19.1              np111py27_0
paramiko                  2.0.2                    py27_0
path.py                   8.2.1                    py27_0
pathlib2                  2.1.0                    py27_0
patsy                     0.4.1                    py27_0
pbr                       1.10.0                   py27_0
pexpect                   4.0.1                    py27_0
pickleshare               0.7.4                    py27_0
pillow                    3.4.2                    py27_0
pip                       8.1.1                    py27_1
pixman                    0.32.6                        0
progressbar               2.3                      py27_0
prompt_toolkit            1.0.9                    py27_0
psutil                    5.0.0                    py27_0
ptyprocess                0.5.1                    py27_0
py                        1.4.31                   py27_0
pyasn1                    0.1.9                    py27_0
pyasn1-modules            0.0.8                    py27_0    bioconda
pybedtools                0.7.8                    py27_1    bioconda
pycairo                   1.10.0                   py27_0
pycli                     2.0.3                    py27_0    bioconda
pycosat                   0.6.1                    py27_0
pycparser                 2.17                     py27_0
pycrypto                  2.6.1                    py27_0
pycurl                    7.19.5.3                 py27_0
pyfaidx                   0.4.7.1                  py27_0    bioconda
pygments                  2.1.3                    py27_0
pynacl                    0.3.0                    py27_0    bioconda
pyopenssl                 16.2.0                   py27_0
pyparsing                 2.1.4                    py27_0
pyqt                      5.6.0                    py27_1
pysam                     0.9.1.4                  py27_0    bioconda
pytest                    3.0.5                    py27_0
pytest-cov                2.4.0                    py27_0    bioconda
pytest-marks              0.4                      py27_0    bioconda
pytest-mock               1.1                      py27_0    bioconda
python                    2.7.12                        1
python-dateutil           2.6.0                    py27_0
python-gflags             2.0                      py27_0
pytz                      2016.10                  py27_0
pyvcf                     0.6.8                    py27_0    bioconda
pyyaml                    3.11                     py27_1
pyzmq                     14.7.0                   py27_0
qt                        5.6.2                         0
rdflib                    4.2.1                    py27_0    bioconda
rdflib-jsonld             0.3                      py27_0    bioconda
readline                  6.2                           2
reportlab                 3.3.0                    py27_0
requests                  2.9.1                    py27_0
requests-toolbelt         0.5.0                    py27_0    bioconda
retry_decorator           1.1.0                    py27_0    bioconda
rsa                       3.1.4                    py27_0    bioconda
ruamel.ordereddict        0.4.6                    py27_0    bioconda
ruamel.yaml               0.12.13                  py27_0    bioconda
ruamel_yaml               0.11.14                  py27_0
samtools                  1.3.1                         5    bioconda
schema-salad              1.20.20161122192122          py27_0    bioconda
scipy                     0.18.1              np111py27_0
seaborn                   0.7.1                    py27_0
seqcluster                1.2.4a                   py27_0    bioconda
setuptools                20.3                     py27_0
sh                        1.11                     py27_0
shellescape               3.4.1                    py27_0    bioconda
simplegeneric             0.8.1                    py27_1
simplejson                3.10.0                   py27_0
singledispatch            3.4.0.3                  py27_0
sip                       4.18                     py27_0
six                       1.10.0                   py27_0
socksipy-branch           1.01                     py27_0    bioconda
sqlalchemy                1.1.4                    py27_0
sqlite                    3.13.0                        0
ssl_match_hostname        3.4.0.2                  py27_1
statsmodels               0.6.1               np111py27_1
tabulate                  0.7.5                    py27_0    bioconda
terminado                 0.6                      py27_0
tk                        8.5.18                        0
toil                      3.5.0a1                  py27_1    bioconda
toolz                     0.8.2                    py27_0
tornado                   4.4.2                    py27_0
traitlets                 4.3.1                    py27_0
typing                    3.5.2.2                  py27_0    bioconda
uritemplate               0.6                      py27_0    bioconda
urllib3                   1.12                     py27_0    bioconda
voluptuous                0.8.8                    py27_0    bioconda
wcwidth                   0.1.7                    py27_0
wheel                     0.29.0                   py27_0
widgetsnbextension        1.2.6                    py27_0
ws4py                     0.3.2                    py27_0    bioconda
xz                        5.2.2                         0
yaml                      0.1.6                         0
yamllint                  1.2.1                    py27_0    bioconda
zeromq                    4.0.5                         0
zlib                      1.2.8                         0

Any idea what might be going on?

chapmanb commented 7 years ago

Thomas; Sorry about the problems, and thanks for the detailed report. I'm not sure exactly what is going wrong but the best way to debug is to try and submit the batch script outside of bcbio:

sbatch ./SLURM_controller7ebeddb7-8cb1-489a-9a9b-a84358a7ed35

This is failing for some reason but unfortunately ipython swallows the error message. Hopefully doing that should give you a useful output that will help us identify the issue. Thanks much.

tomsing1 commented 7 years ago

Thanks a ton for your instantaneous reply. I will submit the script and report back!

tomsing1 commented 7 years ago

When I submitted the job via

sbatch ./SLURM_controller7ebeddb7-8cb1-489a-9a9b-a84358a7ed35
sbatch: error: Batch job submission failed: Requested node configuration is not available

it was clear that the slurm script requested more memory than is available. It includes the following line:

#SBATCH --mem=4000

but each c3.large instance only provides 3.75 Gb of memory. When I adjusted it to #SBATCH --mem=3000, the job was submitted without problems.

Manually editing the script or switchting to larger EC2 instances fixes the issue, thanks a lot for pointing out how to troubleshoot!

One more question: Are the system requirements (e.g. RAM) for the different pipelines or tools documented somewhere? Or perhaps you have recommendations as to which instance type(s) to use for real datasets?

chapmanb commented 7 years ago

Thomas; Glad that helped debug the initial issue and get past that. I'll follow up on #159 since it only really helps if you get unstuck and can actually process things.

Regarding resource usage, this is hard to give a ballpark on without more details about what you're trying to run. We typically do not run clusters on AWS for smaller number of samples since you can get pretty high scale machines with balanced CPU/memory using the m4 series (m4.4xlarge = 16 cores, m4.10xlarge = 40 cores, m4.16xlarge = 64 cores). This saves the overhead of dealing with SLURM and a shared filesystem and also lets you stop/start as needed and use spot instances more easily.

This is not as automated but we have documentation and ansible scripts to help set it up:

https://github.com/chapmanb/bcbio-nextgen/tree/master/scripts/ansible

Happy to help more with that if that seems like a more cost-effective approach for your work.

tomsing1 commented 7 years ago

Thanks a lot, especially for the pointer to the ansible script! I hadn't been aware of this simplified approach, yet.

tomsing1 commented 7 years ago

I am following the instructions on Mac OS X, but I am running into an error (see below). (Please note that I am using a fresh conda environment with ansible 1.9.4.)

# create conda environment with python 2.7
conda create --name ansible python=2 ansible boto
source activate ansible

# provide ansible host file to avoid the following error: 
# ERROR: Unable to find an inventory file, specify one with -i ?
echo "localhost ansible_connection=local ansible_python_interpreter=python" > ansible_hosts
export ANSIBLE_INVENTORY=ansible_hosts

# create encrypted volume
VolumeId=$(aws ec2 create-volume --size 300 --availability-zone us-west-2a --query VolumeId)
aws ec2 create-tags --resources "${VolumeId}" --tags Key=Name,Value=bcbio-rnaseq

# create project_vars.yaml
cat <<- EOF > project_vars.yaml
instance_type: t2.small
spot_price: null
image_id: ami-436c6573
vpc_subnet: ****redacted*****
volume: ${VolumeId}
security_group: bcbio_cluster_sg
keypair: ****redacted*****
iam_role: bcbio_full_s3_access
region: us-west-2
EOF

# download ansible playbook
wget \
  https://raw.githubusercontent.com/chapmanb/bcbio-nextgen/master/scripts/ansible/launch_aws.yaml

# execute ansible playbook
ansible-playbook -vvv launch_aws.yaml   

PLAY [localhost] **************************************************************

TASK: [include_vars project_vars.yaml] ****************************************
ok: [localhost] => {"ansible_facts": {"iam_role": "bcbio_full_s3_access", "image_id": "ami-436c6573", "instance_type": "t2.small", "keypair": "sandmann-public-key", "region": "us-west-2", "security_group": "bcbio_cluster_sg", "spot_price": null, "volume": "vol-0730b69ae8ac1e081", "vpc_subnet": "subnet-3f499a5b"}}

TASK: [Launch EC2 instance] ***************************************************
<127.0.0.1> REMOTE_MODULE ec2 spot_price='' state=present instance_type=t2.small keypair=sandmann-public-key vpc_subnet_id=subnet-3f499a5b image=ami-436c6573 instance_profile_name=bcbio_full_s3_access group=bcbio_cluster_sg
<127.0.0.1> EXEC ['/bin/sh', '-c', 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1482106309.06-113073174003213 && chmod a+rx $HOME/.ansible/tmp/ansible-tmp-1482106309.06-113073174003213 && echo $HOME/.ansible/tmp/ansible-tmp-1482106309.06-113073174003213']
<127.0.0.1> PUT /var/folders/rq/q19k6q511wqglz6kt7jl_t_r0000gp/T/tmp4je7wr TO /Users/sandmann/.ansible/tmp/ansible-tmp-1482106309.06-113073174003213/ec2
<127.0.0.1> EXEC ['/bin/sh', '-c', u'LANG=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 python /Users/sandmann/.ansible/tmp/ansible-tmp-1482106309.06-113073174003213/ec2; rm -rf /Users/sandmann/.ansible/tmp/ansible-tmp-1482106309.06-113073174003213/ >/dev/null 2>&1']
failed: [localhost -> 127.0.0.1] => {"failed": true}
msg: Either region or ec2_url must be specified

FATAL: all hosts have already failed -- aborting

PLAY RECAP ********************************************************************
           to retry, use: --limit @/Users/sandmann/launch_aws.yaml.retry

localhost                  : ok=1    changed=0    unreachable=0    failed=1

conda list
# packages in environment at /Users/sandmann/anaconda/envs/ansible:
#
ansible                   1.9.4                    py27_0    bioconda
boto                      2.43.0                   py27_0
cffi                      1.9.1                    py27_0
cryptography              1.6                      py27_0
enum34                    1.1.6                    py27_0
httplib2                  0.9.2                    py27_0    bioconda
idna                      2.1                      py27_0
ipaddress                 1.0.17                   py27_0
jinja2                    2.8                      py27_1
markupsafe                0.23                     py27_2
openssl                   1.0.2j                        0
paramiko                  2.0.2                    py27_0
pip                       9.0.1                    py27_1
pyasn1                    0.1.9                    py27_0
pycparser                 2.17                     py27_0
pycrypto                  2.6.1                    py27_4
python                    2.7.12                        1
pyyaml                    3.12                     py27_0
readline                  6.2                           2
setuptools                27.2.0                   py27_0
six                       1.10.0                   py27_0
sqlite                    3.13.0                        0
tk                        8.5.18                        0
wheel                     0.29.0                   py27_0
yaml                      0.1.6                         0
zlib                      1.2.8                         3

tomsing1 commented 7 years ago

P.S.: defining the AWS_DEFAULT_REGION doesn't help.

export AWS_DEFAULT_REGION=us-west-2

chapmanb commented 7 years ago

Thomas; Apologies about the issues with the initial test. I pushed a fix to the ansible script to pass region to the ansible ec2 instance creation so hopefully it will work cleanly with your project_vars.yaml if you grab the latest and restart.

Please let us know if you run into any other problems. The ansible scripts are still a work in progress so happy for feedback on pain points and issues.

tomsing1 commented 7 years ago

Great! Thanks for updating the ansible playbook so quickly. I think the region also needs to be included in the Attach working volume step, e.g. by expanding the section like this:

    - name: Attach working volume
      local_action:
        module: ec2_vol
        instance: "{{ item.id }}"
        id: "{{ volume }}"
        device_name: /dev/xvdf
        state: present
        region: "{{ region }}"
      with_items: "{{ ec2.instances }}"

With this modification, the instance is started and the volume is added. The GATHERING FACTS step still prompts me about an unknown host, though:

The authenticity of host ' (::1)' can't be established.
ECDSA key fingerprint is SHA256:7vRNkf7oEygLf++IWAYpLuhOECfjACY/5t4+GgAuUrI.
Are you sure you want to continue connecting (yes/no)

I checked the $HOME/.ssh/known_hosts file and the IP is listed there, as expected. Any idea why I am still prompted for an interactive answer?

Thanks again for looking into this! Please let me know what would be useful for you, eg if I can test things out for you.

chapmanb commented 7 years ago

Thomas; Thanks again for the detailed report and sorry about the continued stumbling blocks. I pushed a fix that I believe will resolve this by setting the SSH configuration options for ansible on the launched hosts rather than trying to update local ssh directly. If you update from the latest GitHub version I hope it'll work cleanly this time. Please let me know if you have any other issues at all.

bcbio / bcbio-nextgen-vm

ipython error when trying to submit jobs on AWS cluster: ERROR | Controller start failed #158