dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.57k stars 719 forks source link

Different administrative profiling results under `uvloop` #4636

Open gjoseph92 opened 3 years ago

gjoseph92 commented 3 years ago

The "Scheduler (Administrative)" or "Worker (Administrative)" profiles from a performance_report appear very different depending on whether you're using uvloop or not. With uvloop, it shows a lot more total time, and a lot less of it seems to be spent doing "actual work":

Using Tornado: report-tornado.html Scheduler profile with tornado

Using uvloop: report-uvloop.html

Screen Shot 2021-03-25 at 8 04 25 PM

These are for the same no-op workload:

import time
import dask
import distributed

if __name__ == "__main__":
    event_loop = dask.config.get("distributed.admin.event-loop", None)

    client = distributed.Client(n_workers=0)

    with distributed.performance_report(filename=f"report-{event_loop}.html"):
        # don't even need to do work or have workers to see the difference
        time.sleep(5)
$ python profiling-basic.py
$ DASK_DISTRIBUTED__ADMIN__EVENT_LOOP=uvloop python profiling-basic.py

It seems the difference is caused by: https://github.com/dask/distributed/blob/bf9ddab3fd67c33b0d87da4f18bcdc7e47262393/distributed/core.py#L186 Basically, having "selectors.py" in there causes us to ignore frames where an asyncio.SelectorEventLoop is awaiting some IO event. I assume the intent here is to not show idle time spent waiting for IO, since that's not interesting from a profiling perspective.

However, uvloop bypasses the selectors module entirely, since libuv handles the polling internally. So we don't drop those frames where the loop is waiting on IO, which makes it look like we're spending nearly all of our time doing nothing... because, well, we are.

(FWIW I suspect you'll see the same issue profiling on Windows even with the default asyncio event loop, since it'll use a ProactorEventLoop which doesn't use selectors at all. Really, with any event loop besides SelectorEventLoop.)

To figure out a solution for this, I removed "selectors.py" from the omit list and then looked at the most common frames/callstacks we saw while profiling:

Code: ```python # NOTE: requires `"selectors.py"` to be removed from the `omit` list here: # https://github.com/dask/distributed/blob/bf9ddab3fd67c33b0d87da4f18bcdc7e47262393/distributed/core.py#L186 import time import collections import dask import distributed def callstack_id(state): id = state["identifier"] children = state["children"] assert len(children) <= 1, "Run with `distributed.worker.profile.cycle` = 0" if not children: return (id,) return callstack_id(next(iter(children.values()))) + (id,) # ^ NOTE: we put `id` last in order to print the callstack from newset frame -> oldest frame. if __name__ == "__main__": # Force every profile state logged to represent exactly 1 frame (don't merge frames every 10ms) dask.config.set({"distributed.worker.profile.cycle": 0}) for config in [ "distributed.admin.event-loop", "distributed.worker.profile.interval", "distributed.worker.profile.cycle", ]: print(f"{config}: {dask.config.get(config, None)}") event_loop = dask.config.get("distributed.admin.event-loop", None) client = distributed.Client(n_workers=0) time.sleep(5) states = client.cluster.scheduler.io_loop.profile.copy() client.shutdown() client.close() callstacks = [callstack_id(state[1]) for state in states] counts = collections.Counter(callstacks) print("Most common callstacks:") for (callstack, count) in counts.most_common(): print(count) print(callstack) print() ```

Most commonly-sampled callstack under default asyncio event loop:

(distributed-dev) Gabes-MacBook-Pro dev/distributed ‹uvloop-profile*› » python profiling-basic.py
distributed.admin.event-loop: tornado
distributed.worker.profile.interval: 10ms
distributed.worker.profile.cycle: 0
Most common callstacks:
419
   select;/Users/gabe/miniconda3/envs/distributed-dev/lib/python3.9/selectors.py;554
   _run_once;/Users/gabe/miniconda3/envs/distributed-dev/lib/python3.9/asyncio/base_events.py;1815
   run_forever;/Users/gabe/miniconda3/envs/distributed-dev/lib/python3.9/asyncio/base_events.py;583
   start;/Users/gabe/miniconda3/envs/distributed-dev/lib/python3.9/site-packages/tornado/platform/asyncio.py;191
   run_loop;/Users/gabe/dev/distributed/distributed/utils.py;427
   run;/Users/gabe/miniconda3/envs/distributed-dev/lib/python3.9/threading.py;881
   _bootstrap_inner;/Users/gabe/miniconda3/envs/distributed-dev/lib/python3.9/threading.py;937
   _bootstrap;/Users/gabe/miniconda3/envs/distributed-dev/lib/python3.9/threading.py;898
   root

1
   root

...

Most commonly-sampled callstack under uvloop:

(distributed-dev) Gabes-MacBook-Pro dev/distributed ‹uvloop-profile*› » DASK_DISTRIBUTED__ADMIN__EVENT_LOOP=uvloop python profiling-basic.py
distributed.admin.event-loop: uvloop
distributed.worker.profile.interval: 10ms
distributed.worker.profile.cycle: 0
Most common callstacks:
418
   start;/Users/gabe/miniconda3/envs/distributed-dev/lib/python3.9/site-packages/tornado/platform/asyncio.py;191
   run_loop;/Users/gabe/dev/distributed/distributed/utils.py;427
   run;/Users/gabe/miniconda3/envs/distributed-dev/lib/python3.9/threading.py;881
   _bootstrap_inner;/Users/gabe/miniconda3/envs/distributed-dev/lib/python3.9/threading.py;937
   _bootstrap;/Users/gabe/miniconda3/envs/distributed-dev/lib/python3.9/threading.py;898
   root

1
   root

...

Looking at that, there isn't one obvious file we should just add to omit (tornado/platform/asyncio.py doesn't seem like a good idea).

We could try targeting our own LoopRunner as the thing to omit (since that should apply to any event loop). We'd just move LoopRunner out of distributed/utils.py and into its own file. However, that probably won't work for asynchronous=True schedulers, since the LoopRunner becomes a pass-through.

Environment:

mrocklin commented 3 years ago

I was about to reply saying "oh, you need to look at the omit= list" and was pleasantly surprised when you already arrive there on your own :)

The simple fix to this, from my perspective, is to add more files to that list, possibly conditional on what kind of event loop we're running. Is there an obvious file to add if we're running with uvloop or on Windows?

On Thu, Mar 25, 2021 at 11:09 PM Gabe Joseph @.***> wrote:

The "Scheduler (Administrative)" or "Worker (Administrative)" profiles from a performance_report appear very different depending on whether you're using uvloop or not. With uvloop, it shows a lot more total time, and a lot less of it seems to be spent doing "actual work":

Using Tornado: report-tornado.html https://gistcdn.githack.com/gjoseph92/27a44c2f88d54ebed81fc00664a95976/raw/a3ef3b46fc3d60a97c8816d06d74910677f90f13/report-tornado.html [image: Scheduler profile with tornado] https://user-images.githubusercontent.com/3309802/112567322-45332780-8da6-11eb-99c8-182c3f72b981.png

Using uvloop: report-uvloop.html https://gistcdn.githack.com/gjoseph92/27a44c2f88d54ebed81fc00664a95976/raw/a3ef3b46fc3d60a97c8816d06d74910677f90f13/report-uvloop.html

[image: Screen Shot 2021-03-25 at 8 04 25 PM] https://user-images.githubusercontent.com/3309802/112567443-7875b680-8da6-11eb-9d52-ce8e7e0c8b30.png

These are for the same no-op workload:

import time import dask import distributed

if name == "main":

event_loop = dask.config.get("distributed.admin.event-loop", None)

client = distributed.Client(n_workers=0)

with distributed.performance_report(filename=f"report-{event_loop}.html"):

    # don't even need to do work or have workers to see the difference

    time.sleep(5)

$ python profiling-basic.py

$ DASK_DISTRIBUTEDADMINEVENT_LOOP=uvloop python profiling-basic.py

It seems the difference is caused by:

https://github.com/dask/distributed/blob/bf9ddab3fd67c33b0d87da4f18bcdc7e47262393/distributed/core.py#L186 Basically, having "selectors.py" in there causes us to ignore frames where an asyncio.SelectorEventLoop https://docs.python.org/3/library/asyncio-eventloop.html#asyncio.SelectorEventLoop is awaiting some IO event. I assume the intent here is to not show idle time spent waiting for IO, since that's not interesting from a profiling perspective.

However, uvloop bypasses the selectors https://docs.python.org/3/library/selectors.html module entirely, since libuv handles the polling internally. So we don't drop those frames where the loop is waiting on IO, which makes it look like we're spending nearly all of our time doing nothing... because, well, we are.

(FWIW I suspect you'll see the same issue profiling on Windows even with the default asyncio event loop, since it'll use a ProactorEventLoop https://docs.python.org/3/library/asyncio-eventloop.html#asyncio.ProactorEventLoop which doesn't use selectors at all. Really, with any event loop besides SelectorEventLoop.)

To figure out a solution for this, I removed "selectors.py" from the omit list and then looked at the most common frames/callstacks we saw while profiling: Code:

NOTE: requires "selectors.py" to be removed from the omit list here:

https://github.com/dask/distributed/blob/bf9ddab3fd67c33b0d87da4f18bcdc7e47262393/distributed/core.py#L186

import time import collections

import dask import distributed

def callstack_id(state):

id = state["identifier"]

children = state["children"]

assert len(children) <= 1, "Run with `distributed.worker.profile.cycle` = 0"

if not children:

    return (id,)

return callstack_id(next(iter(children.values()))) + (id,)

# ^ NOTE: we put `id` last in order to print the callstack from newset frame -> oldest frame.

if name == "main":

# Force every profile state logged to represent exactly 1 frame (don't merge frames every 10ms)

dask.config.set({"distributed.worker.profile.cycle": 0})

for config in [

    "distributed.admin.event-loop",

    "distributed.worker.profile.interval",

    "distributed.worker.profile.cycle",

]:

    print(f"{config}: {dask.config.get(config, None)}")

event_loop = dask.config.get("distributed.admin.event-loop", None)

client = distributed.Client(n_workers=0)

time.sleep(5)

states = client.cluster.scheduler.io_loop.profile.copy()

client.shutdown()

client.close()

callstacks = [callstack_id(state[1]) for state in states]

counts = collections.Counter(callstacks)

print("Most common callstacks:")

for (callstack, count) in counts.most_common():

    print(count)

    print(callstack)

    print()

Most commonly-sampled callstack under default asyncio event loop:

(distributed-dev) Gabes-MacBook-Pro dev/distributed ‹uvloop-profile*› » python profiling-basic.py

distributed.admin.event-loop: tornado

distributed.worker.profile.interval: 10ms

distributed.worker.profile.cycle: 0

Most common callstacks:

419

select;/Users/gabe/miniconda3/envs/distributed-dev/lib/python3.9/selectors.py;554

_run_once;/Users/gabe/miniconda3/envs/distributed-dev/lib/python3.9/asyncio/base_events.py;1815

run_forever;/Users/gabe/miniconda3/envs/distributed-dev/lib/python3.9/asyncio/base_events.py;583

start;/Users/gabe/miniconda3/envs/distributed-dev/lib/python3.9/site-packages/tornado/platform/asyncio.py;191

run_loop;/Users/gabe/dev/distributed/distributed/utils.py;427

run;/Users/gabe/miniconda3/envs/distributed-dev/lib/python3.9/threading.py;881

_bootstrap_inner;/Users/gabe/miniconda3/envs/distributed-dev/lib/python3.9/threading.py;937

_bootstrap;/Users/gabe/miniconda3/envs/distributed-dev/lib/python3.9/threading.py;898

root

1

root

...

Most commonly-sampled callstack under uvloop:

(distributed-dev) Gabes-MacBook-Pro dev/distributed ‹uvloop-profile*› » DASK_DISTRIBUTEDADMINEVENT_LOOP=uvloop python profiling-basic.py

distributed.admin.event-loop: uvloop

distributed.worker.profile.interval: 10ms

distributed.worker.profile.cycle: 0

Most common callstacks:

418

start;/Users/gabe/miniconda3/envs/distributed-dev/lib/python3.9/site-packages/tornado/platform/asyncio.py;191

run_loop;/Users/gabe/dev/distributed/distributed/utils.py;427

run;/Users/gabe/miniconda3/envs/distributed-dev/lib/python3.9/threading.py;881

_bootstrap_inner;/Users/gabe/miniconda3/envs/distributed-dev/lib/python3.9/threading.py;937

_bootstrap;/Users/gabe/miniconda3/envs/distributed-dev/lib/python3.9/threading.py;898

root

1

root

...

Looking at that, there isn't one obvious file we should just add to omit ( tornado/platform/asyncio.py doesn't seem like a good idea).

We could try targeting our own LoopRunner as the thing to omit (since that should apply to any event loop). We'd just move LoopRunner out of distributed/utils.py and into its own file. However, that probably won't work for asynchronous=True schedulers, since the LoopRunner becomes a pass-through.

Environment:

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/4636, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTB7JIJNTBPXF2TONHTTFQCJBANCNFSM4Z2UNNTA .