Azure / batch-shipyard

Simplify HPC and Batch workloads on Azure
MIT License
277 stars 121 forks source link

Failed to create GPU pool for NV vm #244

Closed jeffw-wherethebitsroam closed 6 years ago

jeffw-wherethebitsroam commented 6 years ago

Problem Description

Fails to add a new pool with GPU instances (Standard_nv6)

Batch Shipyard Version

shipyard.py, version 3.6.0b1

Steps to Reproduce

Try to create a pool with the pool config below

Expected Results

Successfully create a pool

Actual Results

Pool creation fails. It tries to download files from gpudrivers.file.core.windows.net, which does not exist.

Confirm agreement with License for Customer Use of NVIDIA Software @ http://www.nvidia.com/content/DriverDownload-March2009/licence.php?lang=us [y/n]: y
2018-10-31 08:22:07.103 INFO - NVIDIA Software License accepted
2018-10-31 08:22:07.104 DEBUG - downloading NVIDIA driver to nvidia-driver-grid.run
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/urllib3/connection.py", line 171, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw)
  File "/usr/lib/python3.6/site-packages/urllib3/util/connection.py", line 56, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/usr/lib/python3.6/socket.py", line 745, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name does not resolve

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 343, in _make_request
    self._validate_conn(conn)
  File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 849, in _validate_conn
    conn.connect()
  File "/usr/lib/python3.6/site-packages/urllib3/connection.py", line 314, in connect
    conn = self._new_conn()
  File "/usr/lib/python3.6/site-packages/urllib3/connection.py", line 180, in _new_conn
    self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7f93fcb7cc88>: Failed to establish a new connection: [Errno -2] Name does not resolve

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/requests/adapters.py", line 445, in send
    timeout=timeout
  File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/lib/python3.6/site-packages/urllib3/util/retry.py", line 398, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='gpudrivers.file.core.windows.net', port=443): Max retries exceeded with url: /nvinstance/Linux/NVIDIA-Linux-x86_64-390.75-grid.run?st=2018-04-03T01%3A34%3A00Z&se=2019-04-04T01%3A34%3A00Z&sp=rl&sv=2017-04-17&sr=s&sig=l3%2FQLZdtT5NL6BQTSOL5KsW%2FiKJK1Ly5iIi2PXpoaDU%3D (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f93fcb7cc88>: Failed to establish a new connection: [Errno -2] Name does not resolve',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/batch-shipyard/shipyard.py", line 2812, in <module>
    cli()
  File "/usr/lib/python3.6/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3.6/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/click/decorators.py", line 64, in new_func
    return ctx.invoke(f, obj, *args[1:], **kwargs)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/opt/batch-shipyard/shipyard.py", line 1453, in pool_add
    ctx.table_client, ctx.config)
  File "/opt/batch-shipyard/convoy/fleet.py", line 3150, in action_pool_add
    batch_client, blob_client, config
  File "/opt/batch-shipyard/convoy/fleet.py", line 1791, in _add_pool
    batch_client, blob_client, config)
  File "/opt/batch-shipyard/convoy/fleet.py", line 1331, in _construct_pool_object
    config, pool_settings.vm_size)
  File "/opt/batch-shipyard/convoy/fleet.py", line 407, in _setup_nvidia_driver_package
    _download_file('NVIDIA driver', pkg, _NVIDIA_DRIVER[gpu_type])
  File "/opt/batch-shipyard/convoy/fleet.py", line 370, in _download_file
    response = requests.get(dldict['url'], stream=True)
  File "/usr/lib/python3.6/site-packages/requests/api.py", line 72, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/lib/python3.6/site-packages/requests/api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python3.6/site-packages/requests/sessions.py", line 512, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python3.6/site-packages/requests/sessions.py", line 622, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3.6/site-packages/requests/adapters.py", line 513, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='gpudrivers.file.core.windows.net', port=443): Max retries exceeded with url: /nvinstance/Linux/NVIDIA-Linux-x86_64-390.75-grid.run?st=2018-04-03T01%3A34%3A00Z&se=2019-04-04T01%3A34%3A00Z&sp=rl&sv=2017-04-17&sr=s&sig=l3%2FQLZdtT5NL6BQTSOL5KsW%2FiKJK1Ly5iIi2PXpoaDU%3D (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f93fcb7cc88>: Failed to establish a new connection: [Errno -2] Name does not resolve',))

Redacted Configuration

pool_specification:
  id: test
  vm_configuration:
    platform_image:
      offer: UbuntuServer
      publisher: Canonical
      sku: 16.04-LTS
  vm_count:
    dedicated: 1
    low_priority: 0
  vm_size: Standard_nv6
  ssh:
    username: shipyard
  virtual_network:
    arm_subnet_id: /subnet/id

Additional Logs

Additonal Comments

jeffw-wherethebitsroam commented 6 years ago

This all seems to work if I use Standard_nc6, so it seems to just be a problem with the NV virtual machine types.

alfpark commented 6 years ago

The GPU driver URL for NV-series VMs has changed. The change is in the develop branch. If you don't want to use that branch, you can modify your pool configuration temporarily to include:

pool_specification:
  # ... other settings
  gpu:
    nvidia_driver:
      source: 'https://go.microsoft.com/fwlink/?linkid=874272'

The fix will be included in the next release.

jeffw-wherethebitsroam commented 6 years ago

As it turns out, the NC series is what I actually need, so this is not a big problem for me.

alfpark commented 6 years ago

Fixed in 3.6.0 release.