Open gjoseph92 opened 3 years ago
I was about to reply saying "oh, you need to look at the omit= list" and was pleasantly surprised when you already arrive there on your own :)
The simple fix to this, from my perspective, is to add more files to that list, possibly conditional on what kind of event loop we're running. Is there an obvious file to add if we're running with uvloop or on Windows?
On Thu, Mar 25, 2021 at 11:09 PM Gabe Joseph @.***> wrote:
The "Scheduler (Administrative)" or "Worker (Administrative)" profiles from a performance_report appear very different depending on whether you're using uvloop or not. With uvloop, it shows a lot more total time, and a lot less of it seems to be spent doing "actual work":
Using Tornado: report-tornado.html https://gistcdn.githack.com/gjoseph92/27a44c2f88d54ebed81fc00664a95976/raw/a3ef3b46fc3d60a97c8816d06d74910677f90f13/report-tornado.html [image: Scheduler profile with tornado] https://user-images.githubusercontent.com/3309802/112567322-45332780-8da6-11eb-99c8-182c3f72b981.png
Using uvloop: report-uvloop.html https://gistcdn.githack.com/gjoseph92/27a44c2f88d54ebed81fc00664a95976/raw/a3ef3b46fc3d60a97c8816d06d74910677f90f13/report-uvloop.html
[image: Screen Shot 2021-03-25 at 8 04 25 PM] https://user-images.githubusercontent.com/3309802/112567443-7875b680-8da6-11eb-9d52-ce8e7e0c8b30.png
These are for the same no-op workload:
import time import dask import distributed
if name == "main":
event_loop = dask.config.get("distributed.admin.event-loop", None) client = distributed.Client(n_workers=0) with distributed.performance_report(filename=f"report-{event_loop}.html"): # don't even need to do work or have workers to see the difference time.sleep(5)
$ python profiling-basic.py
$ DASK_DISTRIBUTEDADMINEVENT_LOOP=uvloop python profiling-basic.py
It seems the difference is caused by:
https://github.com/dask/distributed/blob/bf9ddab3fd67c33b0d87da4f18bcdc7e47262393/distributed/core.py#L186 Basically, having "selectors.py" in there causes us to ignore frames where an asyncio.SelectorEventLoop https://docs.python.org/3/library/asyncio-eventloop.html#asyncio.SelectorEventLoop is awaiting some IO event. I assume the intent here is to not show idle time spent waiting for IO, since that's not interesting from a profiling perspective.
However, uvloop bypasses the selectors https://docs.python.org/3/library/selectors.html module entirely, since libuv handles the polling internally. So we don't drop those frames where the loop is waiting on IO, which makes it look like we're spending nearly all of our time doing nothing... because, well, we are.
(FWIW I suspect you'll see the same issue profiling on Windows even with the default asyncio event loop, since it'll use a ProactorEventLoop https://docs.python.org/3/library/asyncio-eventloop.html#asyncio.ProactorEventLoop which doesn't use selectors at all. Really, with any event loop besides SelectorEventLoop.)
To figure out a solution for this, I removed "selectors.py" from the omit list and then looked at the most common frames/callstacks we saw while profiling: Code:
NOTE: requires
"selectors.py"
to be removed from theomit
list here:https://github.com/dask/distributed/blob/bf9ddab3fd67c33b0d87da4f18bcdc7e47262393/distributed/core.py#L186
import time import collections
import dask import distributed
def callstack_id(state):
id = state["identifier"] children = state["children"] assert len(children) <= 1, "Run with `distributed.worker.profile.cycle` = 0" if not children: return (id,) return callstack_id(next(iter(children.values()))) + (id,) # ^ NOTE: we put `id` last in order to print the callstack from newset frame -> oldest frame.
if name == "main":
# Force every profile state logged to represent exactly 1 frame (don't merge frames every 10ms) dask.config.set({"distributed.worker.profile.cycle": 0}) for config in [ "distributed.admin.event-loop", "distributed.worker.profile.interval", "distributed.worker.profile.cycle", ]: print(f"{config}: {dask.config.get(config, None)}") event_loop = dask.config.get("distributed.admin.event-loop", None) client = distributed.Client(n_workers=0) time.sleep(5) states = client.cluster.scheduler.io_loop.profile.copy() client.shutdown() client.close() callstacks = [callstack_id(state[1]) for state in states] counts = collections.Counter(callstacks) print("Most common callstacks:") for (callstack, count) in counts.most_common(): print(count) print(callstack) print()
Most commonly-sampled callstack under default asyncio event loop:
(distributed-dev) Gabes-MacBook-Pro dev/distributed ‹uvloop-profile*› » python profiling-basic.py
distributed.admin.event-loop: tornado
distributed.worker.profile.interval: 10ms
distributed.worker.profile.cycle: 0
Most common callstacks:
419
select;/Users/gabe/miniconda3/envs/distributed-dev/lib/python3.9/selectors.py;554
_run_once;/Users/gabe/miniconda3/envs/distributed-dev/lib/python3.9/asyncio/base_events.py;1815
run_forever;/Users/gabe/miniconda3/envs/distributed-dev/lib/python3.9/asyncio/base_events.py;583
start;/Users/gabe/miniconda3/envs/distributed-dev/lib/python3.9/site-packages/tornado/platform/asyncio.py;191
run_loop;/Users/gabe/dev/distributed/distributed/utils.py;427
run;/Users/gabe/miniconda3/envs/distributed-dev/lib/python3.9/threading.py;881
_bootstrap_inner;/Users/gabe/miniconda3/envs/distributed-dev/lib/python3.9/threading.py;937
_bootstrap;/Users/gabe/miniconda3/envs/distributed-dev/lib/python3.9/threading.py;898
root
1
root
...
Most commonly-sampled callstack under uvloop:
(distributed-dev) Gabes-MacBook-Pro dev/distributed ‹uvloop-profile*› » DASK_DISTRIBUTEDADMINEVENT_LOOP=uvloop python profiling-basic.py
distributed.admin.event-loop: uvloop
distributed.worker.profile.interval: 10ms
distributed.worker.profile.cycle: 0
Most common callstacks:
418
start;/Users/gabe/miniconda3/envs/distributed-dev/lib/python3.9/site-packages/tornado/platform/asyncio.py;191
run_loop;/Users/gabe/dev/distributed/distributed/utils.py;427
run;/Users/gabe/miniconda3/envs/distributed-dev/lib/python3.9/threading.py;881
_bootstrap_inner;/Users/gabe/miniconda3/envs/distributed-dev/lib/python3.9/threading.py;937
_bootstrap;/Users/gabe/miniconda3/envs/distributed-dev/lib/python3.9/threading.py;898
root
1
root
...
Looking at that, there isn't one obvious file we should just add to omit ( tornado/platform/asyncio.py doesn't seem like a good idea).
We could try targeting our own LoopRunner as the thing to omit (since that should apply to any event loop). We'd just move LoopRunner out of distributed/utils.py and into its own file. However, that probably won't work for asynchronous=True schedulers, since the LoopRunner becomes a pass-through.
Environment:
- Dask version: bf9ddab https://github.com/dask/distributed/commit/bf9ddab3fd67c33b0d87da4f18bcdc7e47262393
- Python version: 3.9.2
- Operating System: macOS
- Install method (conda, pip, source): source
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/4636, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTB7JIJNTBPXF2TONHTTFQCJBANCNFSM4Z2UNNTA .
The "Scheduler (Administrative)" or "Worker (Administrative)" profiles from a
performance_report
appear very different depending on whether you're usinguvloop
or not. Withuvloop
, it shows a lot more total time, and a lot less of it seems to be spent doing "actual work":Using Tornado:
report-tornado.html
Using uvloop:
report-uvloop.html
These are for the same no-op workload:
It seems the difference is caused by: https://github.com/dask/distributed/blob/bf9ddab3fd67c33b0d87da4f18bcdc7e47262393/distributed/core.py#L186 Basically, having
"selectors.py"
in there causes us to ignore frames where an asyncio.SelectorEventLoop isawait
ing some IO event. I assume the intent here is to not show idle time spent waiting for IO, since that's not interesting from a profiling perspective.However, uvloop bypasses the selectors module entirely, since libuv handles the polling internally. So we don't drop those frames where the loop is waiting on IO, which makes it look like we're spending nearly all of our time doing nothing... because, well, we are.
(FWIW I suspect you'll see the same issue profiling on Windows even with the default asyncio event loop, since it'll use a
ProactorEventLoop
which doesn't useselectors
at all. Really, with any event loop besidesSelectorEventLoop
.)To figure out a solution for this, I removed
"selectors.py"
from theomit
list and then looked at the most common frames/callstacks we saw while profiling:Code:
```python # NOTE: requires `"selectors.py"` to be removed from the `omit` list here: # https://github.com/dask/distributed/blob/bf9ddab3fd67c33b0d87da4f18bcdc7e47262393/distributed/core.py#L186 import time import collections import dask import distributed def callstack_id(state): id = state["identifier"] children = state["children"] assert len(children) <= 1, "Run with `distributed.worker.profile.cycle` = 0" if not children: return (id,) return callstack_id(next(iter(children.values()))) + (id,) # ^ NOTE: we put `id` last in order to print the callstack from newset frame -> oldest frame. if __name__ == "__main__": # Force every profile state logged to represent exactly 1 frame (don't merge frames every 10ms) dask.config.set({"distributed.worker.profile.cycle": 0}) for config in [ "distributed.admin.event-loop", "distributed.worker.profile.interval", "distributed.worker.profile.cycle", ]: print(f"{config}: {dask.config.get(config, None)}") event_loop = dask.config.get("distributed.admin.event-loop", None) client = distributed.Client(n_workers=0) time.sleep(5) states = client.cluster.scheduler.io_loop.profile.copy() client.shutdown() client.close() callstacks = [callstack_id(state[1]) for state in states] counts = collections.Counter(callstacks) print("Most common callstacks:") for (callstack, count) in counts.most_common(): print(count) print(callstack) print() ```Most commonly-sampled callstack under default asyncio event loop:
Most commonly-sampled callstack under uvloop:
Looking at that, there isn't one obvious file we should just add to
omit
(tornado/platform/asyncio.py
doesn't seem like a good idea).We could try targeting our own
LoopRunner
as the thing to omit (since that should apply to any event loop). We'd just moveLoopRunner
out ofdistributed/utils.py
and into its own file. However, that probably won't work forasynchronous=True
schedulers, since the LoopRunner becomes a pass-through.Environment: