dask / dask-ec2

Start a cluster in EC2 for dask.distributed
106 stars 37 forks source link

SSL: WRONG_VERSION_NUMBER + Ubuntu 16 #98

Open lyterk opened 7 years ago

lyterk commented 7 years ago

Re-open of #38

I've been hacking away at this issue without much success. AWS now has a deep learning AMI for Ubuntu 16 that would save us a whole bunch of time, so I've been trying to figure out how to make this work. I'd be happy to open a pull request once I get things working, but I could use some direction.

What about Ubuntu 16 is different in how it handles certs that causes this?

What different configurations should I try that would make the problem more tractable?

Stack trace:

SSLError                                  Traceback (most recent call last)
/home/ubuntu/dask-ec2/dask_ec2/cluster.py in get_pepper_client(self)
     54                 self._pepper = libpepper.Pepper(url, ignore_ssl_errors=True)
---> 55                 self._pepper.login('saltdev', 'saltdev', 'pam')
     56             except Exception:

/home/ubuntu/dask-ec2/dask_ec2/libpepper.py in login(self, username, password, eauth)
    286                                         'password': password,
--> 287                                         'eauth': eauth}).get('return', [{}])[0]
    288 

/home/ubuntu/dask-ec2/dask_ec2/libpepper.py in req(self, path, data)
    130                 # con.verify_mode = ssl.CERT_NONE
--> 131                 f = urlopen(req, context=con)
    132             else:

/home/ubuntu/anaconda3/lib/python3.6/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    222         opener = _opener
--> 223     return opener.open(url, data, timeout)
    224 

/home/ubuntu/anaconda3/lib/python3.6/urllib/request.py in open(self, fullurl, data, timeout)
    525 
--> 526         response = self._open(req, data)
    527 

/home/ubuntu/anaconda3/lib/python3.6/urllib/request.py in _open(self, req, data)
    543         result = self._call_chain(self.handle_open, protocol, protocol +
--> 544                                   '_open', req)
    545         if result:

/home/ubuntu/anaconda3/lib/python3.6/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
    503             func = getattr(handler, meth_name)
--> 504             result = func(*args)
    505             if result is not None:

/home/ubuntu/anaconda3/lib/python3.6/urllib/request.py in https_open(self, req)
   1360             return self.do_open(http.client.HTTPSConnection, req,
-> 1361                 context=self._context, check_hostname=self._check_hostname)
   1362 

/home/ubuntu/anaconda3/lib/python3.6/urllib/request.py in do_open(self, http_class, req, **http_conn_args)
   1320                 raise URLError(err)
-> 1321             r = h.getresponse()
   1322         except:

/home/ubuntu/anaconda3/lib/python3.6/http/client.py in getresponse(self)
   1330             try:
-> 1331                 response.begin()
   1332             except ConnectionError:

/home/ubuntu/anaconda3/lib/python3.6/http/client.py in begin(self)
    296         while True:
--> 297             version, status, reason = self._read_status()
    298             if status != CONTINUE:

/home/ubuntu/anaconda3/lib/python3.6/http/client.py in _read_status(self)
    257     def _read_status(self):
--> 258         line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
    259         if len(line) > _MAXLINE:

/home/ubuntu/anaconda3/lib/python3.6/socket.py in readinto(self, b)
    585             try:
--> 586                 return self._sock.recv_into(b)
    587             except timeout:

/home/ubuntu/anaconda3/lib/python3.6/ssl.py in recv_into(self, buffer, nbytes, flags)
   1001                   self.__class__)
-> 1002             return self.read(nbytes, buffer)
   1003         else:

/home/ubuntu/anaconda3/lib/python3.6/ssl.py in read(self, len, buffer)
    864         try:
--> 865             return self._sslobj.read(len, buffer)
    866         except SSLError as x:

/home/ubuntu/anaconda3/lib/python3.6/ssl.py in read(self, len, buffer)
    624         if buffer is not None:
--> 625             v = self._sslobj.read(len, buffer)
    626         else:

SSLError: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:2178)
danielfrg commented 7 years ago

Maybe just need to update PyOpenSSL here: https://github.com/dask/dask-ec2/blob/24102d404696148cbd8a1e084614dac7276d047e/dask_ec2/salt.py#L190

Or provide valid certs here: https://github.com/dask/dask-ec2/blob/516b83d479066b4b510650b64c0b1864b43e4a6f/dask_ec2/templates/rest_cherrypy.conf#L3-L4

dmacd commented 7 years ago

Also just hit this...our group is standardized on ubuntu 16 so going back to 14 is not a real option. Its unclear to me what the issue really is or how we could work around it. Any suggestions?

lyterk commented 7 years ago

So far, I've tried:

This is proving to be a larger problem for us because decent AMI packages for e.g. Tensorflow, CUDA are standardizing around 16.04, and 14.04 is increasingly problematically stale. Also, configuring those manually is quite time-intensive.

lyterk commented 6 years ago

Update: Been digging around with the configuration of saltstack and trying to make any ssl-validated request work from localhost on the child node. I've been swapping in the requests library.

from requests.adapters import HTTPAdapter
from requests.packages.urllib3.poolmanager import PoolManager
import requests
import ssl

class MyAdapter(HTTPAdapter):
    # https://lukasa.co.uk/2013/01/Choosing_SSL_Version_In_Requests/
    def init_poolmanager(self, connections, maxsize, block=false):
        self.poolmanager = poolmanager(num_pools=connections,
                                       maxsize=maxsize,
                                       block=block,
                                       ssl_version=ssl.protocol_tls)

class MyAdapter(HTTPAdapter):
        def init_poolmanager(self, connections, maxsize, block=False):
                self.poolmanager = PoolManager(num_pools=connections,
                                               maxsize=maxsize,
                                               block=block,
                                               cert_file="/etc/pki/tls/certs/localhost.key",
                                               ca_certs="/etc/pki/tls/certs/localhost.crt",
                                               cert_reqs="CERT_REQUIRED",
                                               ssl_version=ssl.PROTOCOL_TLSv1_2)
s = requests.Session()
s.mount("https://", MyAdapter())
url = "https://localhost:8000/login"
headers = {
       'Accept': 'application/json',
       'Content-Type': 'application/json',
       'X-Requested-With': 'XMLHttpRequest',
   }
req = s.get(url, headers=headers, verify="/etc/pki/tls/certs/localhost.crt", auth=("saltdev", "saltdev"))

Still returns SSLError: [SSL: WRONG_VERSION_NUMBER].

OpenSSL investigations: > openssl s_client -connect localhost:8000 Returns, among other things: New, TLSv1/SSLv3, Cipher is ECDHE-RSA-AES256-GCM-SHA384

pitrou commented 6 years ago

Sorry, but what is "localhost:8000" here and how is it related to EC2 or Amazon?

lionfish0 commented 6 years ago

So after a lot of digging, etc, I realised that this issue is probably the basis of the problem:

So to check if this is indeed the problem; on the server (on AWS) I uninstalled salt, downgraded cherrypy to version 3.2.3 and then reinstalled salt* (then rebooted for good measure):

sudo apt-get remove salt-api
sudo pip uninstall cherrypy
sudo pip install cherrypy==3.2.3
sudo apt-get install salt-api

I could test this using the openssl command; openssl s_client -connect 54.194.146.93:8000 -debug

previous output:

read from 0x17bcdb0 [0x17e7993] (5 bytes => 5 (0x5))
0000 - 48 54 54 50 2f                                    HTTP/
write to 0x17bcdb0 [0x17ebee3] (31 bytes => 31 (0x1F))
0000 - 15 03 03 00 1a e5 a0 62-98 dd 8a e6 6f 02 b8 08   .......b....o...
0010 - 6b 9d eb a2 bf 8b ff aa-88 ec 0d dd 77 97 94      k...........w..
140689769182872:error:1408F10B:SSL routines:SSL3_GET_RECORD:wrong version number:s3_pkt.c:365:
write to 0x17bcdb0 [0x17ebee3] (31 bytes => 31 (0x1F))
0000 - 15 03 03 00 1a e5 a0 62-98 dd 8a e6 70 1b 91 02   .......b....p...
0010 - 35 c3 43 89 bb bd d7 e9-d8 41 c4 48 08 32 47      5.C......A.H.2G

output with change on server:

read from 0xc0cdb0 [0xc37993] (5 bytes => 0 (0x0))
read:errno=0
write to 0xc0cdb0 [0xc3bee3] (31 bytes => 31 (0x1F))
0000 - 15 03 03 00 1a 32 b1 57-5e ee 5e 4b 0a 2e 2d ec   .....2.W^.^K..-.
0010 - a6 ca a5 eb c9 e9 ce 10-f5 f8 a5 d2 2b 07 66      ............+.f

I guess that means it's working?

I copied a code snippet from the libpepper.py file in dask_ec2, to reproduce the error:

import ssl
from urllib.request import HTTPHandler, Request, urlopen, install_opener, build_opener
from urllib.error import HTTPError, URLError
import urllib.parse as urlparse
con = ssl.SSLContext(ssl.PROTOCOL_SSLv23)
req = Request('https://54.194.146.93:8000/login')
urlopen(req,context=con)

before this would produce the SSLV3_ALERT_HANDSHAKE_FAILURE or wrong version number errors. But now doesn't fail: <Response [200]>

The only problem is how to have the ubuntu install etc with the older version of cherrypy. I think for myself I'll build a xenial image on AWS with this corrected? Hopefully I can point dask at that? If my understanding of the above is correct and I'm right in my conclusions and fix, maybe building an appropriate image in each region is the way to go? (at least until salt is fixed).

Hopefully the above is useful - sorry if I'm wrong! Hopefully it's useful anyway :)

*warning: I don't know if there are security related bugs in cherrypy that I could be reintroducing here?

edit: I altered dask_ec2/salt.py, and just told it to pip install the 3.2.3 version of cherrypy...

    @retry(retries=3, wait=0)
    def __install_salt_rest_api():
        cmd = "pip install cherrypy==3.2.3"
        ret = master.exec_command(cmd, sudo=True)
        if ret["exit_code"] != 0:
            raise Exception(ret["stderr"].decode('utf-8'))

I think this now works with ubuntu 16.04, without any other changes.

It could do with some testing from other people - e.g. on different versions of ubuntu or using different images, etc.

lionfish0 commented 6 years ago

I've documented my installation procedure etc here, if it's useful!

jpoullet2000 commented 6 years ago

@lionfish0 , thanks for your fixes. I still have some issue with your procedure actually (see below)

Installing scheduler
+---------+----------------------+-----------------+
| Node ID | # Successful actions | # Failed action |
+=========+======================+=================+
| node-0  | 19                   | 4               |
+---------+----------------------+-----------------+
Failed states for 'node-0'
  file | dask-scheduler.conf | /etc/supervisor/conf.d//dask-scheduler.conf | managed: One or more requisite failed: dask.distributed.correct_perms
  file | correct_perms | /opt/anaconda/ | directory: An exception occurred in this state: Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/salt/state.py", line 1878, in call
    **cdata['kwargs'])
  File "/usr/lib/python2.7/dist-packages/salt/loader.py", line 1823, in wrapper
    return f(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/salt/states/file.py", line 3098, in directory
    full, ret, user, group, file_mode, None, follow_symlinks)
  File "/usr/lib/python2.7/dist-packages/salt/modules/file.py", line 4397, in check_perms
    perms['lattrs'] = ''.join(lsattr(name).get('name', ''))
  File "/usr/lib/python2.7/dist-packages/salt/modules/file.py", line 552, in lsattr
    raise SaltInvocationError("File or directory does not exist.")
SaltInvocationError: File or directory does not exist.

  cmd | dask-scheduler-update-supervisor | /usr/bin/supervisorctl -c /etc/supervisor/supervisord.conf update && sleep 2 | wait: One or more requisite failed: dask.distributed.scheduler.dask-scheduler.conf
  supervisord | dask-scheduler-running | dask-scheduler | running: One or more requisite failed: dask.distributed.scheduler.dask-scheduler-update-supervisor, dask.distributed.correct_perms, dask.distributed.scheduler.dask-scheduler.conf
lionfish0 commented 6 years ago

I've started having the same problem too - I think something else has been updated which has caused the above new error.

As it says on the dask-ec2 readme, this project's now deprecated - and so I didn't try fixing the new bug. I tried for a while using kubernetes, but it's quite a pain to set up (not well documented yet maybe) and is serious overkill for what I want. So instead...

I've written a replacement for dask-ec2, I've called daskec2lite.

It needs a little bit more work but is nearly finished - I'll hopefully have some time later in the year to get it to a more 'release' state, but feel free to use it (it currently just makes spot instances, and there's probably other limitations, but hopefully it'll be useful to you). Feel free to add issues/feature-requests or pull requests.

mrocklin commented 6 years ago

If you think that daskec2lite is a good replacement for dask-ec2 I recommend making it more visible first by raising an issue that asks people to investigate it, and then perhaps with a PR to the README

On Wed, May 9, 2018 at 10:13 AM, Mike Smith notifications@github.com wrote:

I've started having the same problem too - I think something else has been updated which has caused the above new error.

As it says on the dask-ec2 readme, this project's now deprecated - and so I didn't try fixing the new bug. I tried for a while using kubernetes, but it's quite a pain to set up (not well documented yet maybe) and is serious overkill for what I want. So instead...

I've written a replacement for dask-ec2, I've called daskec2lite https://github.com/lionfish0/daskec2lite.

It needs a little bit more work but is nearly finished - I'll hopefully have some time later in the year to get it to a more 'release' state, but feel free to use it (it currently just makes spot instances, and there's probably other limitations, but hopefully it'll be useful to you). Feel free to add issues/feature-requests or pull requests.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ec2/issues/98#issuecomment-387752821, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszOoAGDMJKTDvYB9pn3a-UJ1Olo_4ks5twvmTgaJpZM4Oym5U .

lionfish0 commented 6 years ago

I wasn't sure if my failure to use kubernetes etc was just my own incompetence, but I was in a hurry and I needed something - so quickly cobbled together daskec2lite. I'm not sure if it's the best path for people to go down (presumably something that is more cross-cloud-platform would be better), and it needs a little bit more work before advising lots of people to use it. Maybe depending on feedback from a few users I'll see if it's worth finishing and supporting properly... @jpoullet2000 if you do try it - please let me know what works/doesn't.

Thanks @mrocklin, if I go ahead with it as a proper project, I'll make a PR to your README in late June (by then I'll have fixed bugs etc). Great work with dask etc, btw. Thanks!

jpoullet2000 commented 6 years ago

Thx. I'll have a look and let you know.

On 2018-05-09 16:49, Mike Smith wrote:

I wasn't sure if my failure to use kubernetes etc was just my own incompetence, but I was in a hurry and I needed something - so quickly cobbled together daskec2lite. I'm not sure if it's the best path for people to go down (presumably something that is more cross-cloud-platform would be better), and it needs a little bit more work before advising lots of people to use it. Maybe depending on feedback from a few users I'll see if it's worth finishing and supporting properly... @jpoullet2000 https://github.com/jpoullet2000 if you do try it - please let me know what works/doesn't.

Thanks @mrocklin https://github.com/mrocklin, if I go ahead with it as a proper project, I'll make a PR to your README in late June (by then I'll have fixed bugs etc). Great work with dask etc, btw. Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ec2/issues/98#issuecomment-387764889, or mute the thread https://github.com/notifications/unsubscribe-auth/AApyHn_D-BIfbSJGHJfXEFQnl4yacfwfks5twwHwgaJpZM4Oym5U.

jpoullet2000 commented 6 years ago

After a quick test here is the error I get

(dasklite) jbp@jbp-XPS-L521X:~$ daskec2lite --pathtokeyfile ~/.ssh/datascience.pem --keyname datascience --username ubuntu --numinstances 2 --instancetype c4.2xlarge --region eu-west-1 --imageid ami-c8b51fb1 --wpi 2 --sgid sg-c18336bc --spotprice 3
Traceback (most recent call last):
  File "/home/jbp/miniconda3/envs/dasklite/bin/daskec2lite", line 11, in <module>
    sys.exit(main())
  File "/home/jbp/miniconda3/envs/dasklite/lib/python3.6/site-packages/daskec2lite/daskec2lite.py", line 180, in main
    imageid=args.imageid,keyname=args.keyname,spotprice=args.spotprice,region_name=args.region_name)  
  File "/home/jbp/miniconda3/envs/dasklite/lib/python3.6/site-packages/daskec2lite/daskec2lite.py", line 28, in start_cluster
    'SecurityGroupIds': [ sgid ]
  File "/home/jbp/miniconda3/envs/dasklite/lib/python3.6/site-packages/botocore/client.py", line 314, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/jbp/miniconda3/envs/dasklite/lib/python3.6/site-packages/botocore/client.py", line 612, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (InvalidGroup.NotFound) when calling the RequestSpotInstances operation: The security group 'sg-9146afe9' does not exist in VPC 'vpc-a72c1ec0'
lionfish0 commented 6 years ago

As this is for a different project, I've copied the issue over, thanks @jpoullet2000!

lionfish0 commented 6 years ago

@jpoullet2000 by the way, the bug you describe should now be fixed in daskec2lite.