dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.57k stars 718 forks source link

Include server address in all logs #6725

Open gjoseph92 opened 2 years ago

gjoseph92 commented 2 years ago

When viewing logs in which multiple workers (and/or nannies) are mixed into the same stream (for example, all tests in CI), it's often not possible to tell which worker produced a message.

Including the address of the current server in all log messages would make debugging easier in these cases.

This is probably rather tricky because:

  1. our loggers are all per-file global variables, not associated with a particular instance
  2. getting the current instance in an async context is broken: https://github.com/dask/distributed/issues/4959

One of those two things would probably need to be refactored to make this happen.

mrocklin commented 2 years ago

This would also make logs more verbose. Not necessarily enough reason not to do this, but it's worth considering. Mostly I want us to make sure that we want to do this before we decide how we could accomplish it.

On Thu, Jul 14, 2022, 1:30 PM Gabe Joseph @.***> wrote:

When viewing logs in which multiple workers (and/or nannies) are mixed into the same stream (for example, all tests in CI), it's often not possible to tell which worker produced a message.

Including the address of the current server in all log messages would make debugging easier in these cases.

This is probably rather tricky because:

  1. our loggers are all per-file global variables, not associated with a particular instance
  2. getting the current instance in an async context is broken: #4959 https://github.com/dask/distributed/issues/4959

One of those two things would probably need to be refactored to make this happen.

— Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/6725, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTFSMNIX2RNNBIYN2V3VUBMCPANCNFSM53TE57ZQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

gjoseph92 commented 2 years ago

Instead of

2022-07-14 19:24:27,055 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:54646
2022-07-14 19:24:27,065 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-d0c85403-35f8-42b3-82d6-d309a44eff3b Address tcp://127.0.0.1:54646 Status: Status.closing
2022-07-14 19:24:27,077 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:54638
2022-07-14 19:24:27,075 - distributed.nanny - INFO - Nanny asking worker to close
2022-07-14 19:24:27,065 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-d0c85403-35f8-42b3-82d6-d309a44eff3b Address tcp://127.0.0.1:54646 Status: Status.closing
2022-07-14 19:24:27,076 - distributed.nanny - INFO - Nanny asking worker to close
2022-07-14 19:24:27,076 - distributed.nanny - INFO - Nanny asking worker to close

you'd have

2022-07-14 19:24:27,055 - distributed.worker tcp://127.0.0.1:54646 - INFO - Stopping worker at tcp://127.0.0.1:54646
2022-07-14 19:24:27,065 - distributed.worker tcp://127.0.0.1:54646 - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-d0c85403-35f8-42b3-82d6-d309a44eff3b Address tcp://127.0.0.1:54646 Status: Status.closing
2022-07-14 19:24:27,077 - distributed.worker tcp://127.0.0.1:54638 - INFO - Stopping worker at tcp://127.0.0.1:54638
2022-07-14 19:24:27,075 - distributed.nanny tcp://127.0.0.1:54646- INFO - Nanny asking worker to close
2022-07-14 19:24:27,065 - distributed.worker tcp://127.0.0.1:54638 - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-d0c85403-35f8-42b3-82d6-d309a44eff3b Address tcp://127.0.0.1:54646 Status: Status.closing
2022-07-14 19:24:27,076 - distributed.nanny tcp://127.0.0.1:54646 - INFO - Nanny asking worker to close
2022-07-14 19:24:27,076 - distributed.nanny tcp://127.0.0.1:54638 - INFO - Nanny asking worker to close

which is a pretty minimal increase in verbosity, but much more meaningful when debugging.

If log line length is a concern, I'd suggest removing the prefix distributed from every worker/nanny/scheduler message.

mrocklin commented 2 years ago

I don't think that it's minimal. I think that it's meaningful enough to warrant consideration (but not necessarily stop things).

If log line length is a concern, I'd suggest removing the prefix distributed from every worker/nanny/scheduler message.

Yeah, that seems sensible to consider as well I think.

On Thu, Jul 14, 2022 at 3:06 PM Gabe Joseph @.***> wrote:

Instead of

2022-07-14 19:24:27,055 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:54646 2022-07-14 19:24:27,065 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-d0c85403-35f8-42b3-82d6-d309a44eff3b Address tcp://127.0.0.1:54646 Status: Status.closing 2022-07-14 19:24:27,077 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:54638 2022-07-14 19:24:27,075 - distributed.nanny - INFO - Nanny asking worker to close 2022-07-14 19:24:27,065 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-d0c85403-35f8-42b3-82d6-d309a44eff3b Address tcp://127.0.0.1:54646 Status: Status.closing 2022-07-14 19:24:27,076 - distributed.nanny - INFO - Nanny asking worker to close 2022-07-14 19:24:27,076 - distributed.nanny - INFO - Nanny asking worker to close

you'd have

2022-07-14 19:24:27,055 - distributed.worker tcp://127.0.0.1:54646 - INFO - Stopping worker at tcp://127.0.0.1:54646 2022-07-14 19:24:27,065 - distributed.worker tcp://127.0.0.1:54646 - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-d0c85403-35f8-42b3-82d6-d309a44eff3b Address tcp://127.0.0.1:54646 Status: Status.closing 2022-07-14 19:24:27,077 - distributed.worker tcp://127.0.0.1:54638 - INFO - Stopping worker at tcp://127.0.0.1:54638 2022-07-14 19:24:27,075 - distributed.nanny tcp://127.0.0.1:54646- INFO - Nanny asking worker to close 2022-07-14 19:24:27,065 - distributed.worker tcp://127.0.0.1:54638 - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-d0c85403-35f8-42b3-82d6-d309a44eff3b Address tcp://127.0.0.1:54646 Status: Status.closing 2022-07-14 19:24:27,076 - distributed.nanny tcp://127.0.0.1:54646 - INFO - Nanny asking worker to close 2022-07-14 19:24:27,076 - distributed.nanny tcp://127.0.0.1:54638 - INFO - Nanny asking worker to close

which is a pretty minimal increase in verbosity, but much more meaningful when debugging.

If log line length is a concern, I'd suggest removing the prefix distributed from every worker/nanny/scheduler message.

— Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/6725#issuecomment-1184849205, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTCHY5L6T7KN6XAPRUTVUBXN7ANCNFSM53TE57ZQ . You are receiving this because you commented.Message ID: @.***>

gjoseph92 commented 2 years ago

I also don't need this to be the default when deployed (since typically different machines will aggregate logs separately) (though it's still nice to be able to interleave them sometimes). I just want it in CI.