cylc / cylc-uiserver

A Jupyter Server extension that serves the cylc-ui web application for monitoring and controlling Cylc workflows.
https://cylc.org
GNU General Public License v3.0
15 stars 18 forks source link

Test JupyterHub batchspawner with Cylc 8 prototype #15

Closed kinow closed 5 years ago

kinow commented 5 years ago

Documentation for the spawner: https://github.com/jupyterhub/batchspawner

We can use these Docker containers to test Cylc + PBS: https://github.com/kinow/cylc-docker/tree/master/pbs.

But will have to do a few modifications.

Good to have these for the test too:

kinow commented 5 years ago

Based on the current jupyterhub_config.py, and the batchspawner docs, here's what I think we should use:

c.JupyterHub.spawner_class = 'batchspawner.TorqueSpawner'
c.Spawner.args = ['-s', '/opt/cylc-web/dist/']
c.Spawner.cmd = ['cylc-singleuser']
c.Spawner.http_timeout = 120

# BatchSpawnerBase configuration
c.BatchSpawnerBase.req_nprocs = '1'
c.BatchSpawnerBase.req_runtime = '12:00:00'
c.BatchSpawnerBase.req_memory = '1gb'
# TorqueSpawner configuration
c.TorqueSpawner.batch_script = '''#!/bin/sh
#PBS -l walltime={runtime}
#PBS -l nodes=1:ppn={nprocs}
#PBS -l mem={memory}
#PBS -N cylc-singleuser
#PBS -v {keepvars}
{cmd}
'''
kinow commented 5 years ago

Make sure both containers (pbs and cylc) have Python 3

Trying to avoid having to create new containers for Python3, as the current containers have Python 2 only.

For the cylc image, based on Ubuntu, it was simpler: apt install python3 python3-pip. Python 3.6. should be good enough. We are not running Cylc.

Now, for the pbs, Centos based, had to follow: https://tecadmin.net/install-python-3-7-on-centos/. Installed manually, Python 3.7.

kinow commented 5 years ago

Include jupyterhub and batchspawner installed via pip in cylc container

Then on cylc, pip3 install jupyterhub, followed by pip3 install batchspawner. The batchspawner was installed in a blink of an eye, due to its dependencies being already met by jupyterhub.

NB: after this, running jupyterhub failed. We also needed the configurable-http-proxy installed via npm. Thankfully it is a Ubuntu based box, so not so hard to get it working.

kinow commented 5 years ago

Create configuration file to load cylc-singleuser in the pbs node

Bad running Docker with root, but ignoring this issue for now.

root@cylc:/tmp/test# pwd
/tmp/test
root@cylc:/tmp/test# vim jupyterhub_config.py
root@cylc:/tmp/test# cat jupyterhub_config.py 
c.JupyterHub.spawner_class = 'batchspawner.TorqueSpawner'
c.Spawner.args = ['-s', '/tmp/cylc-dist/']
c.Spawner.cmd = ['cylc-singleuser']
c.Spawner.http_timeout = 120

# BatchSpawnerBase configuration
c.BatchSpawnerBase.req_host = 'pbs'
c.BatchSpawnerBase.req_nprocs = '1'
c.BatchSpawnerBase.req_runtime = '12:00:00'
c.BatchSpawnerBase.req_memory = '1gb'
# TorqueSpawner configuration
c.TorqueSpawner.batch_script = '''#!/bin/sh
#PBS -l walltime={runtime}
#PBS -l nodes=1:ppn={nprocs}
#PBS -l mem={memory}
#PBS -N cylc-singleuser
#PBS -v {keepvars}
{cmd}
'''
root@cylc:/tmp/test#
kinow commented 5 years ago

Have cylc-uiserver installed via pip install -e . in pbs, so that cylc-singleuser is available

Had an issue with pip3.7 missing SSL.

pip is configured with locations that require TLS/SSL, however the ssl module in Python is not available

Tried other articles explaining how to fix this, without much luck. So just grabbed Anaconda Python 3.7 :grin: Then just pip install . and it downloaded all dependencies and installed cylc-singleuser.

kinow commented 5 years ago

Have cylc-web assets in pbs node at /opt/cylc-web/dist/.

Running npm (and an old npm) against cylc-web in an old Centos 6 box is madness. Tried installing nvm, but also have an old curl, and older git.

Gave up and built a version of cylc-web. Attached to this comment. As side note, our current cylc-web built is 4.6 MB. Might need to look into what's all these mega bytes in the final project files, whether it can be reduced, etc.

cylc-web-built.tar.gz

bash-4.1# cd ..
bash-4.1# ls
Anaconda3-2019.03-Linux-x86_64.sh  cylc-dist  cylc-uiserver  cylc-web-built.tar.gz  pymp-rv_iv1ei  sshd-stderr---supervisor-Fe3XEg.log  sshd-stdout---supervisor-Bdi1hg.log  trqauthd-unix
bash-4.1# cylc-singleuser
usage: cylc-singleuser [-h] [-p PORT] -s STATIC
cylc-singleuser: error: the following arguments are required: -s/--static
bash-4.1# cylc-singleuser -p 8888 -s cylc-dist
kinow commented 5 years ago

Ok, so in theory we are good to go. Updated one line in jupyterhub_config.py:

c.Spawner.args = ['-s', '/tmp/cylc-dist/']
kinow commented 5 years ago

First try:

[W 2019-04-10 04:14:13.117 JupyterHub auth:642] Failed to open PAM session for testuser: [PAM Error 14] Cannot make/remove an entry for the specified session
[W 2019-04-10 04:14:13.117 JupyterHub auth:643] Disabling PAM sessions from now on.
[I 2019-04-10 04:14:13.128 JupyterHub batchspawner:188] Spawner submitting job using sudo -E -u testuser qsub
[I 2019-04-10 04:14:13.128 JupyterHub batchspawner:189] Spawner submitted script:
    #!/bin/sh
    #PBS -l walltime=12:00:00
    #PBS -l nodes=1:ppn=1
    #PBS -l mem=1gb
    #PBS -N cylc-singleuser
    #PBS -v JUPYTERHUB_SERVICE_PREFIX,JUPYTERHUB_API_URL,PATH,JUPYTERHUB_HOST,JUPYTERHUB_CLIENT_ID,JUPYTERHUB_API_TOKEN,JUPYTERHUB_OAUTH_CALLBACK_URL,JUPYTERHUB_BASE_URL,JPY_API_TOKEN,JUPYTERHUB_USER
    cylc-singleuser --ip="0.0.0.0" --port=51851 -s /tmp/cylc-dist/

[E 2019-04-10 04:14:13.135 JupyterHub user:477] Unhandled error starting testuser's server: /bin/sh: 1: sudo: not found
[D 2019-04-10 04:14:13.148 JupyterHub user:578] Deleting oauth client jupyterhub-user-testuser
[E 2019-04-10 04:14:13.166 JupyterHub web:1788] Uncaught exception GET /hub/user/testuser/ (::ffff:172.20.0.1)
    HTTPServerRequest(protocol='http', host='localhost:8000', method='GET', uri='/hub/user/testuser/', version='HTTP/1.1', remote_ip='::ffff:172.20.0.1')
    Traceback (most recent call last):
      File "/usr/local/lib/python3.5/dist-packages/tornado/web.py", line 1699, in _execute
        result = await result
      File "/usr/local/lib/python3.5/dist-packages/jupyterhub/handlers/base.py", line 1062, in get
        await self.spawn_single_user(user)
      File "/usr/local/lib/python3.5/dist-packages/jupyterhub/handlers/base.py", line 715, in spawn_single_user
        timedelta(seconds=self.slow_spawn_timeout), finish_spawn_future
      File "/usr/lib/python3.5/asyncio/futures.py", line 361, in __iter__
        yield self  # This tells Task to wait for completion.
      File "/usr/lib/python3.5/asyncio/tasks.py", line 296, in _wakeup
        future.result()
      File "/usr/lib/python3.5/asyncio/futures.py", line 274, in result
        raise self._exception
      File "/usr/lib/python3.5/asyncio/tasks.py", line 239, in _step
        result = coro.send(None)
      File "/usr/local/lib/python3.5/dist-packages/jupyterhub/handlers/base.py", line 636, in finish_user_spawn
        await spawn_future
      File "/usr/local/lib/python3.5/dist-packages/jupyterhub/user.py", line 489, in spawn
        raise e
      File "/usr/local/lib/python3.5/dist-packages/jupyterhub/user.py", line 409, in spawn
        url = await gen.with_timeout(timedelta(seconds=spawner.start_timeout), f)
      File "/usr/lib/python3.5/asyncio/futures.py", line 361, in __iter__
        yield self  # This tells Task to wait for completion.
      File "/usr/lib/python3.5/asyncio/tasks.py", line 296, in _wakeup
        future.result()
      File "/usr/lib/python3.5/asyncio/futures.py", line 274, in result
        raise self._exception
      File "/usr/local/lib/python3.5/dist-packages/batchspawner/batchspawner.py", line 303, in start
        job = yield self.submit_batch_script()
      File "/usr/lib/python3.5/asyncio/futures.py", line 274, in result
        raise self._exception
      File "/usr/local/lib/python3.5/dist-packages/batchspawner/batchspawner.py", line 190, in submit_batch_script
        out = yield run_command(cmd, input=script, env=self.get_env())
      File "/usr/lib/python3.5/asyncio/futures.py", line 274, in result
        raise self._exception
      File "/usr/local/lib/python3.5/dist-packages/batchspawner/batchspawner.py", line 59, in run_command
        raise RuntimeError(eout)
    RuntimeError: /bin/sh: 1: sudo: not found

[D 2019-04-10 04:14:13.170 JupyterHub base:890] No template for 500
[E 2019-04-10 04:14:13.218 JupyterHub log:150] {
      "X-Forwarded-Host": "localhost:8000",
      "Accept-Encoding": "gzip, deflate",
      "Referer": "http://localhost:8000/hub/login?next=",
      "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:66.0) Gecko/20100101 Firefox/66.0",
      "X-Forwarded-Proto": "http",
      "Accept-Language": "en-US,en;q=0.5",
      "Cookie": "jupyterhub-hub-login=\"2|1:0|10:1554869652|20:jupyterhub-hub-login|44:MzkyZGIxYTE1MmZlNDJiZjlkZDE4OGYxZjk5Y2UxMmY=|96284af39e5f764fdce4ff1abc9db7766614e101f5de4b2fae292ee2ecb5d99a\"; jenkins-timestamper-offset=-46800000; username-localhost-8888=\"2|1:0|10:1554238296|23:username-localhost-8888|44:NTNlNWQzMTNiZmIyNDBjZGEwYjlhOTQ5NzI0ZmZmMTM=|12dff019f0af888bfbb21fcc22edd2077fb3dd7afb3191d2b24f612a67ea0f78\"; _xsrf=2|024a1abb|1e0a6ff7e87d6aaa6505cc8f4620856d|1552952779; jupyterhub-session-id=1a884a7c9f4044bd8d57b9d73415c37c",
      "Upgrade-Insecure-Requests": "1",
      "Connection": "close",
      "X-Forwarded-Port": "8000",
      "Host": "localhost:8000",
      "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
      "X-Forwarded-For": "::ffff:172.20.0.1"
    }
[E 2019-04-10 04:14:13.218 JupyterHub log:158] 500 GET /hub/user/testuser/ (testuser@::ffff:172.20.0.1) 197.42ms
kinow commented 5 years ago

Second try after apt install sudo -y

[E 2019-04-10 04:15:44.980 JupyterHub user:477] Unhandled error starting testuser's server: sudo: qsub: command not found
kinow commented 5 years ago

Third try after apt install -y torque_client then editing /var/spool/torque/server_name and changing server name to pbs.

Got a bit closer, but still no success.

Screenshot_2019-04-10_16-20-20

Spawn failed: pbs_iff: cannot read reply from pbs_server No Permission. qsub: cannot connect to server pbs (errno=15007) Unauthorized Request

The problem is that the cluster was built with Cylc submitting jobs by connecting via SSH then using qsub in the remote pbs node.

kinow commented 5 years ago

Logs from server_logs in Torque PBS:

04/10/2019 05:44:22;0080;PBS_Server.1017;Req;dis_request_read;conflicting version numbers, 1 detected, 2 expected
04/10/2019 05:44:39;0001;PBS_Server.1011;Svr;PBS_Server;LOG_ERROR::Unknown node  (15064) in process_host_name_part, host docker not found
04/10/2019 05:45:12;0001;PBS_Server.1011;Svr;PBS_Server;LOG_ERROR::Unknown node  (15064) in process_host_name_part, host docker not found
kinow commented 5 years ago

Setting up the PBS client in the Ubuntu node was a bit trickier than I expected. The PBS server was a Centos with the adapativecomputing/torque installation 5.0.0. While the Ubuntu package torque-client was version 2.x from PBS Torque I think. The server_logs were showing the client version, and failing due to lack of permission. It wasn't sure if there was an issue with the client version, or if I forgot to add operators/managers, nodes allowed, etc.

So went with a different approach. A single Centos 6 box, with the PBS server, which was basically following the instructions and running docker run -h docker.example.com -p 10022:22 -p 8000:8000 -i -t --name torque --privileged agaveapi/torque bash (the extra part is the -p 8000 for the JupyterHub port to be exposed locally.

Before running the supervisord daemon, also needed:

yum install wget sudo
yum update curl
yum reinstall cracklib-dicts

wget $NVM_DOWNLOAD
nvm install 8.0.0
npm install -g configurable-http-proxy

passwd
passwd testuser

ln -s /usr/local/bin/qsub /usr/bin/qsub (then same for qstat and qdel)

Installed the Cylc packages locally, and then we had everything ready for tests.

UI submitting the PBS job in background via spawner:

Screenshot_2019-04-11_11-54-49

Message displaying the command/job setting used:

Screenshot_2019-04-11_11-55-01

Our jupyterhub_config.py contains the timeout in seconds. Here's what happens when the timeout expires:

timeout-output

Couldn't find anything in the PBS server logs. So looked at the current queues and nodes, and noticed that the nodes were all off:

no-queues

Fixed after I noticed the -h docker.example.com, which doesn't match with the docker host. Added the docker.example.com with number of processors=1 to that file in the server_priv folder, restarted supervisord, then got the following error:

args_error

That's because we use argparse.parse_args(), which fails if extra parameters are passed. Fixed by calling .parse_known_args, which returns a tuple (known, unknown). This change was in the cylc_singleuser.py file, and I did not have to re-install anything as the pip installation was editable.

Et voila!

Screenshot_2019-04-11_12-20-30

It is also possible to see that the job is still running in PBS:

Screenshot_2019-04-11_13-14-45

kinow commented 5 years ago

And the processes as displayed with ps and qstat for reference:

bash-4.1# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0  13472  4700 pts/0    Ss   00:35   0:00 bash
root       613  0.0  0.0  66688  5804 pts/0    S    00:51   0:00 /usr/sbin/sshd -D
root      2220  0.0  0.0  11492  2652 pts/1    Ss   01:04   0:00 /bin/bash
root      3498  0.0  0.2  99604 15208 pts/0    S    01:12   0:00 /usr/bin/python /usr/bin/supervisord
root      3501  0.5  1.1 114120 66920 pts/0    SLl  01:12   0:02 /usr/local/sbin/pbs_mom -D -L /var/log/supervisor/pbsmom.log
root      3504  0.1  0.3 774340 23448 pts/0    Sl   01:12   0:00 /usr/local/sbin/pbs_server -D -L /var/log/supervisor/pbsserver.log
root      3505  0.0  0.0  59612  5644 pts/0    S    01:12   0:00 /usr/local/sbin/trqauthd -D
root      3506  0.0  0.1  63036  7064 ?        Ss   01:12   0:00 /usr/local/sbin/pbs_sched -p /var/log/supervisor/pbssched.log -L /var/log/supervisor/pbssched.log
root      3643  1.4  0.7 680072 45976 pts/0    Sl+  01:18   0:00 /usr/local/bin/python3.7 /usr/local/python37/bin/jupyterhub
root      3648  0.5  0.7 1141456 42880 ?       Ssl  01:18   0:00 node /.nvm/versions/node/v8.0.0/bin/configurable-http-proxy --ip  --port 8000 --api-ip 127.0.0.1 --api-port 8001 --error-target http://127.0.0.1:8
testuser  3662  0.0  0.0 108220  3040 ?        Ss   01:18   0:00 -bash
testuser  3673  0.0  0.0 106116  2552 ?        S    01:18   0:00 /bin/sh /var/spool/torque/mom_priv/jobs/9.docker.example.com.SC
testuser  3674  1.7  0.5 222260 31220 ?        S    01:18   0:01 /usr/local/bin/python3.7 /usr/local/python37/bin/cylc-singleuser --ip=0.0.0.0 --port=41453 -s /tmp/cylc-dist/
root      3695  0.0  0.0  13380  1864 pts/1    R+   01:19   0:00 ps aux
bash-4.1#

bash-4.1# qstat
Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
7.docker                   torque.submit    testuser               0 C debug          
9.docker                   cylc-singleuser  testuser        00:00:00 R debug
matthewrmshin commented 5 years ago

:+1: Very cool setup. I am sure it will find lot of usages in various partner sites that we'll have to support.