dask / dask-gateway

A multi-tenant server for securely deploying and managing Dask clusters.
https://gateway.dask.org/
BSD 3-Clause "New" or "Revised" License
137 stars 88 forks source link

Slurm Job Fails Due to Missing SSL Certificates When Creating Cluster using dask-gateway-server #705

Open woestler opened 1 year ago

woestler commented 1 year ago

When I created a cluster on HPC using Slurm and dask-gateway-server, I encountered a problem. My understanding of the running process is as follows: when dask-gateway-server receives the new_cluster command from the client, it converts the command into an sbatch command. I have edited the dask_gateway_server/backends/jobqueue/slurm.py file and print the variables cmd, env, and script in get_submit_cmd_env_stdin, the output are as follows:

cmd


['/usr/bin/sbatch', '--parsable', '--job-name=dask-gateway', '--chdir=/home/dask/.dask-gateway/014af831909a4d8ab6b900b03fc9598d', '--output=/home/dask/.dask-gateway/014af831909a4d8ab6b900b03fc9598d/dask-scheduler-014af831909a4d8ab6b900b03fc9598d.log', '--cpus-per-task=2', '--mem=4096M', '--export=DASK_DISTRIBUTED__COMM__REQUIRE_ENCRYPTION,DASK_DISTRIBUTED__COMM__TLS__CA_FILE,DASK_DISTRIBUTED__COMM__TLS__SCHEDULER__CERT,DASK_DISTRIBUTED__COMM__TLS__SCHEDULER__KEY,DASK_GATEWAY_API_TOKEN,DASK_GATEWAY_API_URL,DASK_GATEWAY_CLUSTER_NAME']

env


 {'DASK_DISTRIBUTED__COMM__REQUIRE_ENCRYPTION': 'True', 'DASK_GATEWAY_API_URL': '<http://local3:8000/api>', 'DASK_GATEWAY_API_TOKEN': '3497e6f64a16424eae3b5545f151fb79', 'DASK_GATEWAY_CLUSTER_NAME': '014af831909a4d8ab6b900b03fc9598d', 'DASK_DISTRIBUTED__COMM__TLS__CA_FILE': '/home/dask/.dask-gateway/014af831909a4d8ab6b900b03fc9598d/dask.crt', 'DASK_DISTRIBUTED__COMM__TLS__SCHEDULER__KEY': '/home/dask/.dask-gateway/014af831909a4d8ab6b900b03fc9598d/dask.pem', 'DASK_DISTRIBUTED__COMM__TLS__SCHEDULER__CERT': '/home/dask/.dask-gateway/014af831909a4d8ab6b900b03fc9598d/dask.crt'}

script

#!/bin/sh
source /opt/dask-gateway/anaconda/bin/activate /opt/dask
dask-scheduler --protocol tls --port 0 --host 0.0.0.0 --dashboard-address 0.0.0.0:0 --preload dask_gateway.scheduler_preload --dg-api-address 0.0.0.0:0 --dg-heartbeat-period 15 --dg-adaptive-period 3.0 --dg-idle-timeout 0.0

When the Slurm node receives this command and begins execution, if the non-edge node receives the Slurm Job, it will try to find the dask.crt and dask.pem files that appear in the environment variables above, but these files do not exist on this node. The Slurm task will fail and the error message is as follows:

2023-05-29 17:09:58,047 - distributed.preloading - INFO - Import preload module: dask_gateway.scheduler_preload
/opt/dask/lib/python3.10/site-packages/distributed/cli/dask_scheduler.py:140: FutureWarning: dask-scheduler is deprecated and will be removed in a future release; use `dask scheduler` instead
  warnings.warn(
2023-05-29 17:09:58,049 - distributed.scheduler - INFO - -----------------------------------------------
2023-05-29 17:09:58,050 - distributed.preloading - INFO - Creating preload: dask_gateway.scheduler_preload
2023-05-29 17:09:58,050 - distributed.preloading - INFO - Import preload module: dask_gateway.scheduler_preload
2023-05-29 17:09:58,050 - distributed.scheduler - INFO - End scheduler
Traceback (most recent call last):
  File "/opt/dask/bin/dask-scheduler", line 8, in <module>
    sys.exit(main())
  File "/opt/dask/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/dask/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/opt/dask/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/dask/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/dask/lib/python3.10/site-packages/distributed/cli/dask_scheduler.py", line 249, in main
    asyncio.run(run())
  File "/opt/dask/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/dask/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/opt/dask/lib/python3.10/site-packages/distributed/cli/dask_scheduler.py", line 209, in run
    scheduler = Scheduler(
  File "/opt/dask/lib/python3.10/site-packages/distributed/scheduler.py", line 3464, in __init__
    self.connection_args = self.security.get_connection_args("scheduler")
  File "/opt/dask/lib/python3.10/site-packages/distributed/security.py", line 342, in get_connection_args
    "ssl_context": self._get_tls_context(tls, ssl.Purpose.SERVER_AUTH),
  File "/opt/dask/lib/python3.10/site-packages/distributed/security.py", line 299, in _get_tls_context
    ctx = ssl.create_default_context(purpose=purpose, cafile=ca)
  File "/opt/dask/lib/python3.10/ssl.py", line 766, in create_default_context
    context.load_verify_locations(cafile, capath, cadata)
FileNotFoundError: [Errno 2] No such file or directory

@jcrist @consideRatio @TomAugspurger @jacobtomlinson @martindurant

selvavm commented 10 months ago

Hi, I am also facing the same issue. Can someone please support me on this? My understanding is dask-gateway sets the environment variable for the location of dask.crt which is the staging location but it never copies the dask.crt to that location.

selvavm commented 10 months ago

@woestler - Did you resolve this? @TomAugspurger, @jacobtomlinson, @martindurant - Any support will be much appreciated

jlynchMicron commented 1 month ago

Could be related: https://github.com/dask/distributed/issues/4617