Closed kinow closed 5 years ago
Based on the current jupyterhub_config.py, and the batchspawner docs, here's what I think we should use:
c.JupyterHub.spawner_class = 'batchspawner.TorqueSpawner'
c.Spawner.args = ['-s', '/opt/cylc-web/dist/']
c.Spawner.cmd = ['cylc-singleuser']
c.Spawner.http_timeout = 120
# BatchSpawnerBase configuration
c.BatchSpawnerBase.req_nprocs = '1'
c.BatchSpawnerBase.req_runtime = '12:00:00'
c.BatchSpawnerBase.req_memory = '1gb'
# TorqueSpawner configuration
c.TorqueSpawner.batch_script = '''#!/bin/sh
#PBS -l walltime={runtime}
#PBS -l nodes=1:ppn={nprocs}
#PBS -l mem={memory}
#PBS -N cylc-singleuser
#PBS -v {keepvars}
{cmd}
'''
Trying to avoid having to create new containers for Python3, as the current containers have Python 2 only.
For the cylc
image, based on Ubuntu, it was simpler: apt install python3 python3-pip
. Python 3.6. should be good enough. We are not running Cylc.
Now, for the pbs
, Centos based, had to follow: https://tecadmin.net/install-python-3-7-on-centos/. Installed manually, Python 3.7.
Then on cylc
, pip3 install jupyterhub
, followed by pip3 install batchspawner
. The batchspawner
was installed in a blink of an eye, due to its dependencies being already met by jupyterhub
.
NB: after this, running jupyterhub
failed. We also needed the configurable-http-proxy
installed via npm
. Thankfully it is a Ubuntu based box, so not so hard to get it working.
Bad running Docker with root, but ignoring this issue for now.
root@cylc:/tmp/test# pwd
/tmp/test
root@cylc:/tmp/test# vim jupyterhub_config.py
root@cylc:/tmp/test# cat jupyterhub_config.py
c.JupyterHub.spawner_class = 'batchspawner.TorqueSpawner'
c.Spawner.args = ['-s', '/tmp/cylc-dist/']
c.Spawner.cmd = ['cylc-singleuser']
c.Spawner.http_timeout = 120
# BatchSpawnerBase configuration
c.BatchSpawnerBase.req_host = 'pbs'
c.BatchSpawnerBase.req_nprocs = '1'
c.BatchSpawnerBase.req_runtime = '12:00:00'
c.BatchSpawnerBase.req_memory = '1gb'
# TorqueSpawner configuration
c.TorqueSpawner.batch_script = '''#!/bin/sh
#PBS -l walltime={runtime}
#PBS -l nodes=1:ppn={nprocs}
#PBS -l mem={memory}
#PBS -N cylc-singleuser
#PBS -v {keepvars}
{cmd}
'''
root@cylc:/tmp/test#
Had an issue with pip3.7
missing SSL.
pip is configured with locations that require TLS/SSL, however the ssl module in Python is not available
Tried other articles explaining how to fix this, without much luck. So just grabbed Anaconda Python 3.7 :grin: Then just pip install .
and it downloaded all dependencies and installed cylc-singleuser
.
Running npm
(and an old npm
) against cylc-web
in an old Centos 6 box is madness. Tried installing nvm
, but also have an old curl
, and older git
.
Gave up and built a version of cylc-web
. Attached to this comment. As side note, our current cylc-web
built is 4.6 MB. Might need to look into what's all these mega bytes in the final project files, whether it can be reduced, etc.
bash-4.1# cd ..
bash-4.1# ls
Anaconda3-2019.03-Linux-x86_64.sh cylc-dist cylc-uiserver cylc-web-built.tar.gz pymp-rv_iv1ei sshd-stderr---supervisor-Fe3XEg.log sshd-stdout---supervisor-Bdi1hg.log trqauthd-unix
bash-4.1# cylc-singleuser
usage: cylc-singleuser [-h] [-p PORT] -s STATIC
cylc-singleuser: error: the following arguments are required: -s/--static
bash-4.1# cylc-singleuser -p 8888 -s cylc-dist
Ok, so in theory we are good to go. Updated one line in jupyterhub_config.py
:
c.Spawner.args = ['-s', '/tmp/cylc-dist/']
First try:
[W 2019-04-10 04:14:13.117 JupyterHub auth:642] Failed to open PAM session for testuser: [PAM Error 14] Cannot make/remove an entry for the specified session
[W 2019-04-10 04:14:13.117 JupyterHub auth:643] Disabling PAM sessions from now on.
[I 2019-04-10 04:14:13.128 JupyterHub batchspawner:188] Spawner submitting job using sudo -E -u testuser qsub
[I 2019-04-10 04:14:13.128 JupyterHub batchspawner:189] Spawner submitted script:
#!/bin/sh
#PBS -l walltime=12:00:00
#PBS -l nodes=1:ppn=1
#PBS -l mem=1gb
#PBS -N cylc-singleuser
#PBS -v JUPYTERHUB_SERVICE_PREFIX,JUPYTERHUB_API_URL,PATH,JUPYTERHUB_HOST,JUPYTERHUB_CLIENT_ID,JUPYTERHUB_API_TOKEN,JUPYTERHUB_OAUTH_CALLBACK_URL,JUPYTERHUB_BASE_URL,JPY_API_TOKEN,JUPYTERHUB_USER
cylc-singleuser --ip="0.0.0.0" --port=51851 -s /tmp/cylc-dist/
[E 2019-04-10 04:14:13.135 JupyterHub user:477] Unhandled error starting testuser's server: /bin/sh: 1: sudo: not found
[D 2019-04-10 04:14:13.148 JupyterHub user:578] Deleting oauth client jupyterhub-user-testuser
[E 2019-04-10 04:14:13.166 JupyterHub web:1788] Uncaught exception GET /hub/user/testuser/ (::ffff:172.20.0.1)
HTTPServerRequest(protocol='http', host='localhost:8000', method='GET', uri='/hub/user/testuser/', version='HTTP/1.1', remote_ip='::ffff:172.20.0.1')
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tornado/web.py", line 1699, in _execute
result = await result
File "/usr/local/lib/python3.5/dist-packages/jupyterhub/handlers/base.py", line 1062, in get
await self.spawn_single_user(user)
File "/usr/local/lib/python3.5/dist-packages/jupyterhub/handlers/base.py", line 715, in spawn_single_user
timedelta(seconds=self.slow_spawn_timeout), finish_spawn_future
File "/usr/lib/python3.5/asyncio/futures.py", line 361, in __iter__
yield self # This tells Task to wait for completion.
File "/usr/lib/python3.5/asyncio/tasks.py", line 296, in _wakeup
future.result()
File "/usr/lib/python3.5/asyncio/futures.py", line 274, in result
raise self._exception
File "/usr/lib/python3.5/asyncio/tasks.py", line 239, in _step
result = coro.send(None)
File "/usr/local/lib/python3.5/dist-packages/jupyterhub/handlers/base.py", line 636, in finish_user_spawn
await spawn_future
File "/usr/local/lib/python3.5/dist-packages/jupyterhub/user.py", line 489, in spawn
raise e
File "/usr/local/lib/python3.5/dist-packages/jupyterhub/user.py", line 409, in spawn
url = await gen.with_timeout(timedelta(seconds=spawner.start_timeout), f)
File "/usr/lib/python3.5/asyncio/futures.py", line 361, in __iter__
yield self # This tells Task to wait for completion.
File "/usr/lib/python3.5/asyncio/tasks.py", line 296, in _wakeup
future.result()
File "/usr/lib/python3.5/asyncio/futures.py", line 274, in result
raise self._exception
File "/usr/local/lib/python3.5/dist-packages/batchspawner/batchspawner.py", line 303, in start
job = yield self.submit_batch_script()
File "/usr/lib/python3.5/asyncio/futures.py", line 274, in result
raise self._exception
File "/usr/local/lib/python3.5/dist-packages/batchspawner/batchspawner.py", line 190, in submit_batch_script
out = yield run_command(cmd, input=script, env=self.get_env())
File "/usr/lib/python3.5/asyncio/futures.py", line 274, in result
raise self._exception
File "/usr/local/lib/python3.5/dist-packages/batchspawner/batchspawner.py", line 59, in run_command
raise RuntimeError(eout)
RuntimeError: /bin/sh: 1: sudo: not found
[D 2019-04-10 04:14:13.170 JupyterHub base:890] No template for 500
[E 2019-04-10 04:14:13.218 JupyterHub log:150] {
"X-Forwarded-Host": "localhost:8000",
"Accept-Encoding": "gzip, deflate",
"Referer": "http://localhost:8000/hub/login?next=",
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:66.0) Gecko/20100101 Firefox/66.0",
"X-Forwarded-Proto": "http",
"Accept-Language": "en-US,en;q=0.5",
"Cookie": "jupyterhub-hub-login=\"2|1:0|10:1554869652|20:jupyterhub-hub-login|44:MzkyZGIxYTE1MmZlNDJiZjlkZDE4OGYxZjk5Y2UxMmY=|96284af39e5f764fdce4ff1abc9db7766614e101f5de4b2fae292ee2ecb5d99a\"; jenkins-timestamper-offset=-46800000; username-localhost-8888=\"2|1:0|10:1554238296|23:username-localhost-8888|44:NTNlNWQzMTNiZmIyNDBjZGEwYjlhOTQ5NzI0ZmZmMTM=|12dff019f0af888bfbb21fcc22edd2077fb3dd7afb3191d2b24f612a67ea0f78\"; _xsrf=2|024a1abb|1e0a6ff7e87d6aaa6505cc8f4620856d|1552952779; jupyterhub-session-id=1a884a7c9f4044bd8d57b9d73415c37c",
"Upgrade-Insecure-Requests": "1",
"Connection": "close",
"X-Forwarded-Port": "8000",
"Host": "localhost:8000",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"X-Forwarded-For": "::ffff:172.20.0.1"
}
[E 2019-04-10 04:14:13.218 JupyterHub log:158] 500 GET /hub/user/testuser/ (testuser@::ffff:172.20.0.1) 197.42ms
Second try after apt install sudo -y
[E 2019-04-10 04:15:44.980 JupyterHub user:477] Unhandled error starting testuser's server: sudo: qsub: command not found
Third try after apt install -y torque_client
then editing /var/spool/torque/server_name
and changing server name to pbs
.
Got a bit closer, but still no success.
Spawn failed: pbs_iff: cannot read reply from pbs_server No Permission. qsub: cannot connect to server pbs (errno=15007) Unauthorized Request
The problem is that the cluster was built with Cylc submitting jobs by connecting via SSH then using qsub
in the remote pbs
node.
Logs from server_logs
in Torque PBS:
04/10/2019 05:44:22;0080;PBS_Server.1017;Req;dis_request_read;conflicting version numbers, 1 detected, 2 expected
04/10/2019 05:44:39;0001;PBS_Server.1011;Svr;PBS_Server;LOG_ERROR::Unknown node (15064) in process_host_name_part, host docker not found
04/10/2019 05:45:12;0001;PBS_Server.1011;Svr;PBS_Server;LOG_ERROR::Unknown node (15064) in process_host_name_part, host docker not found
Setting up the PBS client in the Ubuntu node was a bit trickier than I expected. The PBS server was a Centos with the adapativecomputing/torque installation 5.0.0. While the Ubuntu package torque-client was version 2.x from PBS Torque I think. The server_logs were showing the client version, and failing due to lack of permission. It wasn't sure if there was an issue with the client version, or if I forgot to add operators/managers, nodes allowed, etc.
So went with a different approach. A single Centos 6 box, with the PBS server, which was basically following the instructions and running docker run -h docker.example.com -p 10022:22 -p 8000:8000 -i -t --name torque --privileged agaveapi/torque bash
(the extra part is the -p 8000 for the JupyterHub port to be exposed locally.
Before running the supervisord daemon, also needed:
yum install wget sudo
yum update curl
yum reinstall cracklib-dicts
wget $NVM_DOWNLOAD
nvm install 8.0.0
npm install -g configurable-http-proxy
passwd
passwd testuser
ln -s /usr/local/bin/qsub /usr/bin/qsub (then same for qstat and qdel)
Installed the Cylc packages locally, and then we had everything ready for tests.
UI submitting the PBS job in background via spawner:
Message displaying the command/job setting used:
Our jupyterhub_config.py contains the timeout in seconds. Here's what happens when the timeout expires:
Couldn't find anything in the PBS server logs. So looked at the current queues and nodes, and noticed that the nodes were all off:
Fixed after I noticed the -h docker.example.com, which doesn't match with the docker host. Added the docker.example.com with number of processors=1 to that file in the server_priv folder, restarted supervisord, then got the following error:
That's because we use argparse.parse_args(), which fails if extra parameters are passed. Fixed by calling .parse_known_args, which returns a tuple (known, unknown). This change was in the cylc_singleuser.py file, and I did not have to re-install anything as the pip installation was editable.
Et voila!
It is also possible to see that the job is still running in PBS:
And the processes as displayed with ps and qstat for reference:
bash-4.1# ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 13472 4700 pts/0 Ss 00:35 0:00 bash
root 613 0.0 0.0 66688 5804 pts/0 S 00:51 0:00 /usr/sbin/sshd -D
root 2220 0.0 0.0 11492 2652 pts/1 Ss 01:04 0:00 /bin/bash
root 3498 0.0 0.2 99604 15208 pts/0 S 01:12 0:00 /usr/bin/python /usr/bin/supervisord
root 3501 0.5 1.1 114120 66920 pts/0 SLl 01:12 0:02 /usr/local/sbin/pbs_mom -D -L /var/log/supervisor/pbsmom.log
root 3504 0.1 0.3 774340 23448 pts/0 Sl 01:12 0:00 /usr/local/sbin/pbs_server -D -L /var/log/supervisor/pbsserver.log
root 3505 0.0 0.0 59612 5644 pts/0 S 01:12 0:00 /usr/local/sbin/trqauthd -D
root 3506 0.0 0.1 63036 7064 ? Ss 01:12 0:00 /usr/local/sbin/pbs_sched -p /var/log/supervisor/pbssched.log -L /var/log/supervisor/pbssched.log
root 3643 1.4 0.7 680072 45976 pts/0 Sl+ 01:18 0:00 /usr/local/bin/python3.7 /usr/local/python37/bin/jupyterhub
root 3648 0.5 0.7 1141456 42880 ? Ssl 01:18 0:00 node /.nvm/versions/node/v8.0.0/bin/configurable-http-proxy --ip --port 8000 --api-ip 127.0.0.1 --api-port 8001 --error-target http://127.0.0.1:8
testuser 3662 0.0 0.0 108220 3040 ? Ss 01:18 0:00 -bash
testuser 3673 0.0 0.0 106116 2552 ? S 01:18 0:00 /bin/sh /var/spool/torque/mom_priv/jobs/9.docker.example.com.SC
testuser 3674 1.7 0.5 222260 31220 ? S 01:18 0:01 /usr/local/bin/python3.7 /usr/local/python37/bin/cylc-singleuser --ip=0.0.0.0 --port=41453 -s /tmp/cylc-dist/
root 3695 0.0 0.0 13380 1864 pts/1 R+ 01:19 0:00 ps aux
bash-4.1#
bash-4.1# qstat
Job ID Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
7.docker torque.submit testuser 0 C debug
9.docker cylc-singleuser testuser 00:00:00 R debug
:+1: Very cool setup. I am sure it will find lot of usages in various partner sites that we'll have to support.
Documentation for the spawner: https://github.com/jupyterhub/batchspawner
We can use these Docker containers to test Cylc + PBS: https://github.com/kinow/cylc-docker/tree/master/pbs.
But will have to do a few modifications.
pbs
andcylc
) have Python 3jupyterhub
andbatchspawner
installed viapip
incylc
containercylc-singleuser
in thepbs
nodecylc-uiserver
installed viapip install -e .
inpbs
, so thatcylc-singleuser
is availablecylc-web
assets inpbs
node at/opt/cylc-web/dist/
.cylc-singleserver
is our UI server, running from the PBS node. That will start the Tornado server and serve requests for both HTTP GET/POST requests for REST API (and GraphQL in the future) and also the HTTP requests for resources such as images, CSS, HTML, etc. A copy of the files must reside in this node as well.Good to have these for the test too: