coiled / feedback

A place to provide Coiled feedback
14 stars 3 forks source link

RAPIDS release 23.04 (problem with docker based clusters) #239

Closed jacobtomlinson closed 1 year ago

jacobtomlinson commented 1 year ago

RAPIDS 23.04 is out.

I've just tried to spin up a Coiled cluster via both the docker and conda/mamba methods.

Conda

In [6]: import coiled
   ...:
   ...: coiled.create_software_environment(
   ...:     name="rapids-stable-23-04-conda",
   ...:     account="dask",
   ...:     gpu_enabled=True,
   ...:     conda={
   ...:         "channels": ['rapidsai', 'conda-forge', 'nvidia'],
   ...:         "dependencies": ['rapids=23.04', 'python=3.10', 'cudatoolkit=11.8']
   ...:     },
   ...: )

Docker

In [1]: import coiled
   ...:
   ...: coiled.create_software_environment(
   ...:     name="rapids-stable-23-04-docker",
   ...:     account="dask",
   ...:     container="nvcr.io/nvidia/rapidsai/rapidsai-core:23.04-cuda11.8-runtime-ubuntu22.04-py3.10",
   ...: )

Both environments were created but when using them in a cluster only the conda method works but the Docker-based approach is failing and I see the following in the logs.


distributed.preloading - INFO - Downloading preload at https://cloud.coiled.io/api/v2/cluster_facing/preload/worker
Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/rapids/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/distributed/cli/dask_spec.py", line 46, in <module>
    main()
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/distributed/cli/dask_spec.py", line 42, in main
    asyncio.run(run())
  File "/opt/conda/envs/rapids/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/envs/rapids/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/distributed/cli/dask_spec.py", line 36, in run
    servers = await run_spec(_spec, *args)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/distributed/deploy/spec.py", line 680, in run_spec
    workers[k] = cls(*args, **d.get("opts", {}))
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/dask_cuda/cuda_worker.py", line 192, in __init__
    self.nannies = [
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/dask_cuda/cuda_worker.py", line 193, in <listcomp>
    Nanny(
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/distributed/nanny.py", line 202, in __init__
    self.preloads = preloading.process_preloads(
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/distributed/preloading.py", line 244, in process_preloads
    return [
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/distributed/preloading.py", line 245, in <listcomp>
    Preload(dask_server, p, argv, file_dir)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/distributed/preloading.py", line 191, in __init__
    self.module = _download_module(name)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/distributed/preloading.py", line 150, in _download_module
    exec(compiled, module.__dict__)
  File "https://cloud.coiled.io/api/v2/cluster_facing/preload/worker", line 105, in <module>
  File "/opt/conda/envs/rapids/lib/python3.10/multiprocessing/pool.py", line 930, in __init__
    Pool.__init__(self, processes, initializer, initargs)
  File "/opt/conda/envs/rapids/lib/python3.10/multiprocessing/pool.py", line 215, in __init__
    self._repopulate_pool()
  File "/opt/conda/envs/rapids/lib/python3.10/multiprocessing/pool.py", line 306, in _repopulate_pool
    return self._repopulate_pool_static(self._ctx, self.Process,
  File "/opt/conda/envs/rapids/lib/python3.10/multiprocessing/pool.py", line 329, in _repopulate_pool_static
    w.start()
  File "/opt/conda/envs/rapids/lib/python3.10/multiprocessing/dummy/__init__.py", line 51, in start
    threading.Thread.start(self)
  File "/opt/conda/envs/rapids/lib/python3.10/threading.py", line 935, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread

Could someone take a look at this?

Sidenote: The conda method is soooooo much faster, awesome stuff. Maybe it's worth switching the default example at the top of the GPU page to use that instead.

ntabris commented 1 year ago

For reasons yet to be determined, Ubuntu 22 images are causing that error. We figured this out recently and have a "known issue" explanation going in docs very soon.

ntabris commented 1 year ago

reasons yet to be determined

To be explicit, most likely seccomp, good chance it's clone3, I need to try change and verify everything works.

jacobtomlinson commented 1 year ago

Thanks @ntabris, I can confirm that using our 20.04 based image works as expected.

In [1]: import coiled
   ...:
   ...: coiled.create_software_environment(
   ...:     name="rapids-stable-23-04-docker",
   ...:     account="dask",
   ...:     container="nvcr.io/nvidia/rapidsai/rapidsai-core:23.04-cuda11.8-runtime-ubuntu20.04-py3.10",
   ...: )

Should be good to update your docs with the new versions though. I'll make a note in our docs that RAPIDS makes Ubuntu 22.04 the default base, but 20.04 is still available and required for Coiled for the time being.

dchudz commented 1 year ago

@scharlottej13 docs update please when you get a minute

ntabris commented 1 year ago

fyi @jacobtomlinson we'll deploy the fix so Ubuntu 22 works in the next day or two. I can make an issue or PR on rapidsai/deployment when this is out.