dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.57k stars 718 forks source link

Debugpy on AWS Fargate #8385

Open Vesyrak opened 10 months ago

Vesyrak commented 10 months ago

Describe the issue: When launching the scheduler on our AWS Fargate instance, everything works as intended. However, when launching the scheduler with debugpy, to enable remote debugging, 90% of the time the dashboard does not start. This causes our cluster to fail, as we depend on its healthcheck to monitor the cluster health. Once every while, it does boot correctly, but this success appears to be rare and at random. We correctly configured the Fargate instance for remote debugging, and in the scenarios where it does boot, we can successfully debug the dask scheduler. The scheduler logs show no errors, and claims that the dashboard boots.

Is there any way to check its logs, or figure out the cause for this? We cannot reproduce this issue locally.

Environment:

mrocklin commented 10 months ago

Hi @Vesyrak

As to why when running with debugpy things are sad I don't know. I don't have experience with that project.

For getting logs when things don't work, this is typically done by the system hosting Dask, in this case Fargate. Dask just puts logs in stdout/stderr. You'll want to figure out what Fargate does with those. At Coiled (managed Dask service) we tend to route logs to cloudwatch and then use cloudwatch APIs to serve up those logs. Maybe you could do something similar?

hendrikmakait commented 9 months ago

@Vesyrak: Is there anything actionable for us to do here?