cooperative-computing-lab / cctools

The Cooperative Computing Tools (cctools) enable large scale distributed computations to harness hundreds to thousands of machines from clusters, clouds, and grids.
http://ccl.cse.nd.edu
Other
133 stars 114 forks source link

Parsl-Vine: Serial Task Latency, Round 2 #3894

Open dthain opened 1 month ago

dthain commented 1 month ago

@colinthomas-z80 fixed some corner cases in the Parsl-TaskVine executor in #3432 that improved the sequential-chained task latency from about 1 task/s to about 6 tasks/s. However, we should be able to do better than this: @gpauloski observed similar performance in #3892

While this isn't the ideal use case for TaskVine (it would be better to submit many tasks at once) it is a useful benchmark that stresses the critical path through the executor module. Let's figure out what the bottleneck is, and how we can do better.

benclifford commented 1 month ago

Using parsl-perf, the parsl "generate a load of tasks" performance tool, gives me these results. This is all in the parsl github repo so you should be able to run these commands yourself and generate your own equivalent or contradictory results.

It's submitting a bunch of tasks, not sequential tasks - so this issue may be mistitled.

Baseline (parsl with in-process execution - no fancy execution tech at all):

$ parsl-perf --config parsl/tests/configs/local_threads.py --time 30
...
Tasks per second: 2001.290

Using parsl's HighPerformanceExecutor:

$ parsl-perf --config parsl/tests/configs/htex_local.py --time 30
Tasks per second: 1038.789

Using Work Queue + coprocesses:

This one slows down quite dramatically as you increase the target time (aka increase the number of tasks at once) so I think there's some linear-access-cost data structure somewhere...

$ parsl-perf --config parsl/tests/configs/workqueue_ex.py --time 30
Tasks per second: 340.689

So all of the above are in a pretty decent 300-2000 tasks per second range. Probably we don't need to quibble too much about the exact numbers there because even things like changing Python logging configuration can change things drastically at this task rate.

Using Task Vine (in regular mode, but performance is only slightly better with serverless mode enabled):

$ parsl-perf --config parsl/tests/configs/taskvine_ex.py --time 30
...
==== Iteration 2 ====
Will run 50 tasks to target 30.0 seconds runtime
Submitting tasks / invoking apps
All 50 tasks submitted ... waiting for completion
Submission took 0.010 seconds = 4813.073 tasks/second
Runtime: actual 25.673s vs target 30.0s
Tasks per second: 1.948

With different target durations (and so different numbers of tasks submitted) this is pretty consistent around 2 tasks/second - I tried with a 10 minute target runtime and the resulting 1043 tasks execute/complete at 1.7 tasks per second.

That ~2, much lower than the 300-2000 tasks/second of the other executors, is what concerns me.

I have not dug into what this looks like any deeper than this - just came prominently to my attention when talking to @gpauloski about what become #3892. It's probably worth me spending an hour or two on this from a parsl perspective though.

dthain commented 1 month ago

Yikes, something is way off there. Does that last command generate TV manager/worker logs that we can see?

benclifford commented 1 month ago

attached is the parsl rundir generated by $ parsl-perf --config parsl/tests/configs/taskvine_ex.py --time 30

using master branch cctools 4b43490dc8bbbc93472ef3e228828b657c92e1cb

$ vine_worker --version
vine_worker version 8.0.0 DEVELOPMENT (released 2024-07-31 17:16:38 +0000)
    Built by benc@parsl-dev-3-12-3510 on 2024-07-31 17:16:38 +0000
    System: Linux parsl-dev-3-12-3510 6.5.0-44-generic #44~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Jun 18 14:36:16 UTC 2 x86_64 GNU/Linux
    Configuration: --prefix=/home/benc/parsl/src/cctools/I

task vine logs and other parsl stuff is in there.

cc-issue-3894.zip

dthain commented 1 month ago

Ok, so here is what I can see from a quick look at the manager log:

2024/07/31 17:24:34.67 vine_manager[240872] vine: parsl-dev-3-12-3510 (127.0.0.1:59416) busy on 'python3 exec_parsl_function.py map function argument result'
2024/07/31 17:24:34.67 vine_manager[240872] vine: Task 9 state change: READY (1) to RUNNING (2)
2024/07/31 17:24:34.67 vine_manager[240872] vine: rx from parsl-dev-3-12-3510 (127.0.0.1:59416): info tasks_running 1
2024/07/31 17:24:35.19 vine_manager[240872] vine: rx from parsl-dev-3-12-3510 (127.0.0.1:59416): cache-update file-rnd-pdfqcnirzsqkkoq 1 0 8 33188 510444 1722446674673169 X

@colinthomas-z80 I'll let you take it from here, can you set up the python benchmark and see what we get?

benclifford commented 1 month ago

telling tasks to use a single core does speed things up by 2.5x-ish

$ parsl-perf --config parsl/tests/configs/taskvine_ex.py --time 30 --resources '{"cores":1, "memory":1, "disk":1}'
...
Tasks per second: 4.989

To avoid Python startup costs, I guess I should be pushing on serverless mode, in the same way that we prefer coprocess mode with WQ... that worked earlier today (with slightly improved performance but still around the 2 tasks/second rate) when I was using a build from April but now seems to be failing with a cctools build from today. I'll debug that a little bit and write more here.

Serverless mode isn't tested in our CI - although I would like it to be.

benclifford commented 1 month ago

ok, looked like I needed to make clean harder...

Here's a config for serverless mode: (not in our git repo, that's why I'm pasting it here)

$ cat parsl/tests/configs/taskvine_serverless.py 
from parsl.config import Config
from parsl.data_provider.file_noop import NoOpFileStaging
from parsl.data_provider.ftp import FTPInTaskStaging
from parsl.data_provider.http import HTTPInTaskStaging
from parsl.executors.taskvine import TaskVineExecutor, TaskVineManagerConfig

def fresh_config():
    return Config(executors=[TaskVineExecutor(manager_config=TaskVineManagerConfig(port=9000, shared_fs=True),
                                              worker_launch_method='factory', function_exec_mode='serverless')])

and parsl-perf runs like this:

$ parsl-perf --config parsl/tests/configs/taskvine_serverless.py --time 30 --resources '{"cores":1, "memory":1, "disk":1}'
...
==== Iteration 2 ====
Will run 66 tasks to target 30.0 seconds runtime
...
Tasks per second: 2.430
dthain commented 1 month ago

Yep something stinks there, thank you for pointing it out. Will let Colin dig into what's going on...

colinthomas-z80 commented 1 month ago

I get proportionally the same result on a 16 core machine.

parsl-perf --config parsl/tests/configs/taskvine_ex.py --time 30 --resources '{"cores":1, "memory":1, "disk":1}

Runtime: actual 31.904s vs target 30.0s
Tasks per second: 15.860
Cleaning up DFK
The end

Continuing some tests. I don't think we are seeing the same blocking wait cycle as before.

colinthomas-z80 commented 1 month ago

So far there seems to be two components to the issue:

The first of which is one bug currently being worked on in #3889, when running n core tasks on a local n core worker the requested disk allocation will exceed remaining disk and cause the task to be forsaken. This happens periodically while running the performance benchmark. This is likely the cause of the variability in throughput since its nature is unpredictable, but if it were to never occur the throughput would still be limited to around 4 tasks per second.

The main throughput issue seems to reside somewhere on the worker in the overhead of the _exec_parslfunction.py script. The running time of each task recorded at the worker is around .24 seconds, which takes just about all of the time needed to have a limit of 4 tasks per second. I will study the script and perhaps wrestle with a python gprof equivalent. @tphung3 might have some ideas as well.

tphung3 commented 1 month ago

That's interesting. Workqueue has a similar exec_parsl_function.py. I wonder what the number is for wq. @colinthomas-z80 the fastest way to measure the overhead of exec_parsl_function.py alone is probably to temporarily stick some logging at the start and end of that script. This should help locate whether the true overhead is in the script or in taskvine worker.

benclifford commented 1 month ago

@tphung3 I pasted an example run for WQ using coprocesses in an earlier comment, as well as Task Vine+serverless mode that I think is supposed to be roughly equivalent?

https://github.com/cooperative-computing-lab/cctools/issues/3894#issuecomment-2260985918

There I get around 500 tasks per second when parsl-perf submits 4000 tasks, dropping off as it submits bigger batches.

If I turn off coprocesses and use this config, I get around 3 tasks per second:

from parsl.config import Config
from parsl.data_provider.file_noop import NoOpFileStaging
from parsl.data_provider.ftp import FTPInTaskStaging
from parsl.data_provider.http import HTTPInTaskStaging
from parsl.executors import WorkQueueExecutor

def fresh_config():
    return Config(executors=[WorkQueueExecutor(port=9000,
                                               coprocess=False)])

$ $ parsl-perf --config ./wqc.py --time 30

==== Iteration 8 ====
Will run 98 tasks to target 30.0 seconds runtime
Submitting tasks / invoking apps
All 98 tasks submitted ... waiting for completion
Submission took 0.022 seconds = 4486.719 tasks/second
Runtime: actual 34.504s vs target 30.0s
Tasks per second: 2.840

So I feel like for this kind of performance the coprocess-equivalent mode (not starting a new Python script each time) is the way to go - which is the TaskVine serverless examples I pasted above, right?

Loading a new python environment per task is going to be stuck like this no matter how efficient you make the launching of that heavyweight process, and so is futile to pursue here, I think.

benclifford commented 1 month ago

so i'll phrase my initial problem a little differently, perhaps: when I use the taskvine mode that i thought was meant to make stuff go faster by not launching a new python process per task, it does not go faster.

dthain commented 1 month ago

Right, the problem here is not so much that TaskVine+PythonTask has overhead (we knew that) but that TaskVine+FunctionCall is not significantly faster.

@colinthomas-z80 what numbers do you get with the taskvine serverless config? I am wondering if perhaps the executed function has an internal import that needs to be hoisted...

benclifford commented 1 month ago

Unless you're doing something to change import behaviour, running an import that is already imported into the current Python process should be low-cost, I think - roughly the cost of looking up if that import happened already - eg:

cost to start a python process that imports parsl:

$ time python3 im.py 

real    0m0.496s
user    0m0.435s
sys 0m0.053s

cost to start a python process that imports parsl 1,000,000 times in a loop:

real    0m0.608s
user    0m0.519s
sys 0m0.088s
$ cat im.py 

for n in range(1000000):
  import parsl
colinthomas-z80 commented 1 month ago

About the same for me

parsl-perf --config parsl/tests/configs/taskvine_serverless.py --time 10 --resources '{"cores":1, "memory":1, "disk":1}'

==== Iteration 1 ====
Will run 10 tasks to target 10.0 seconds runtime
Submitting tasks / invoking apps
All 10 tasks submitted ... waiting for completion
Submission took 0.003 seconds = 3104.592 tasks/second
Runtime: actual 2.892s vs target 10.0s
Tasks per second: 3.458
==== Iteration 2 ====
Will run 34 tasks to target 10.0 seconds runtime
Submitting tasks / invoking apps
All 34 tasks submitted ... waiting for completion
Submission took 0.007 seconds = 4759.414 tasks/second
Runtime: actual 9.946s vs target 10.0s
Tasks per second: 3.418
Cleaning up DFK
The end
JinZhou5042 commented 1 month ago

@gpauloski Could you send to me the logs of some of the runs you are having trouble?

benclifford commented 1 month ago

@JinZhou5042 you can find logs for the runs causing problems in this issue #3894 by running the commands that @colinthomas-z80 has pasted above and looking in the runinfo/ directory.

@gpauloski 's issue is #3892 and is different to this #3894 issue - his issue does not directly involve Parsl or the Parsl+TaskVine interface.

colinthomas-z80 commented 1 month ago

Using a basic benchmark of TaskVine Python Task without parsl I achieved a throughput of 8-10 tasks per second.

I approximated the parsl environment as well by adding some files to the no-op task which usually would be arguments, map, and result files using TaskVine/Parsl and it did not significantly affect the result.

Although this is slightly better than the TaskVine executor performance I think we have established the limits of creating a new Python process for each task.

dthain commented 1 month ago

I think that's totally within the expect performance of PythonTask. So now lets figure out why serverless is not doing substantially better than that...

JinZhou5042 commented 1 month ago

I tried the serverless mode and the program doesn't complete at the second iteration

QQ_1722886759403

dthain commented 1 month ago

It looks like your software environment is coffea2024.4.0 which likely has a bunch of stuff not needed here. Can you guys establish a common Conda working environment.

benclifford commented 1 month ago

@JinZhou5042 once you've got a consistent environment with @colinthomas-z80 like @dthain suggests - if it still fails in the way you are showing above, i can see that it successfully ran a few tasks, so runinfo/ should have some more info about whats going on when the second batch (of 71 tasks) is submitted.

colinthomas-z80 commented 1 month ago

With #3909 merged in we now have a relatively consistent result using a serverless config with the performance benchmark. However the throughput achieved is only marginally better than regular Python Task: around 10-12 tasks/second on my local machine.

In contrast to function invocations purely using the taskvine api, the implementation using parsl/taskvine requires the serialization of a number of files each invocation and perhaps some redundant communication not present in the base implementation. My understanding of this is that it is necessary to accommodate the runtime declared functions given to the executor from parsl rather than the static library creation approach we have here.

We will need to find whether improvements can be made or if inspiration can be taken from WQ coprocess which we see performs well. Perhaps at some expense to the "plug and play" nature of the current implementation.

colinthomas-z80 commented 2 weeks ago

I opened a PR at parsl adding the parsl serialize module to the "hoisted" preamble of the library context. It seems we were not effectively caching the module beforehand. Doing this has improved throughput in parsl-perf to 30-40 task per second on a single core.

https://github.com/Parsl/parsl/pull/3604