cooperative-computing-lab / cctools

The Cooperative Computing Tools (cctools) enable large scale distributed computations to harness hundreds to thousands of machines from clusters, clouds, and grids.
http://ccl.cse.nd.edu
Other
134 stars 120 forks source link

Parsl/WQ: work_queue_worker spawns two tasks when it has been given three #3593

Open benclifford opened 11 months ago

benclifford commented 11 months ago

This is with WQ 7.6.1 with a user I am supporting at DESC.

Running some user application, the work queue manager code assigns 3 tasks to a worker (that 3 being the number of tasks that fit into the worker resources).

However, after a while the worker only ever has two tasks spawned.

In worker debug, I see a very long delay from a task arriving at the worker and the relevant process being launched - a new task process isn't launched until a previous process ends, with the task sitting on the worker for (in the below case) 3 minutes:

In the attached tarball, for example, submit_scripts/parsl.work_queue.block-0.1702576992.2495306.sh.err contains work_queue_worker -d all output, and one wrong looking task is:

2023/12/14 12:10:04.64 work_queue_worker[2101795] wq: rx: task 42

2023/12/14 12:13:18.08 work_queue_worker[2101795] wq: task 58 (pid 690353) exited normally with exit code 0

2023/12/14 12:13:18.09 work_queue_worker[2101795] wq: started process 882570: python3 exec_parsl_function.py map function result

2023/12/14 12:21:50.89 work_queue_worker[2101795] wq: task 42 (pid 882570) exited normally with exit code 0

Other logs for WQ are in 009/work_queue

I couldn't see anything about this immediately obvious in git logs between version 7.6.1 and now.

jchiang-exhaustion.tar.gz

dthain commented 10 months ago

Hmm I think your intuition is correct here. About the only reason a worker would delay running a task would be insufficient resources available. However, the manager should not be generally sending workers more than it thinks they can run right away. (Unless "overcommit" is enabled in vine_tune.)

Here is what I see in the transaction_log. The tasks are submitted with an initial guess of 5GB memory and one core each:

1702576991888399 2101488 TASK 1 WAITING 00038065_008_011 FIRST_RESOURCES {"memory":[5120,"MB"],"disk":[0,"MB"],
"cores":[1,"cores"]}
1702576992559713 2101488 TASK 2 WAITING 00038065_022_025 FIRST_RESOURCES {"memory":[5120,"MB"],"disk":[0,"MB"],
"cores":[1,"cores"]}
1702576992560395 2101488 TASK 3 WAITING 00038065_026_029 FIRST_RESOURCES {"memory":[5120,"MB"],"disk":[0,"MB"],
"cores":[1,"cores"]}

The manager assigns tasks to the worker. It looks to be in the round-up mode -- since three 5GB tasks fit into the 16GB RAM available, it divides the available resources in three and gives each task 5333MB and 85 cores each. So far so good.

1702576992598877 2101488 TASK 51 RUNNING 128.55.64.46:37894  FIRST_RESOURCES {"memory":[5333,"MB"],"disk":[144686,"MB"],"gpus":[0,"gpus"],"cores":[85,"cores"]}
1702576992599033 2101488 TRANSFER INPUT 48 1 0.000000 0.000005 exec_parsl_function.py
1702576992599444 2101488 TRANSFER INPUT 48 0 0.008276 0.000386 function
1702576992599867 2101488 TRANSFER INPUT 48 0 0.000127 0.000369 map
1702576992600385 2101488 TASK 48 RUNNING 128.55.64.46:37894  FIRST_RESOURCES {"memory":[5333,"MB"],"disk":[144686,"MB"],"gpus":[0,"gpus"],"cores":[85,"cores"]}
1702576992600554 2101488 TRANSFER INPUT 43 1 0.000000 0.000004 exec_parsl_function.py
1702576992600936 2101488 TRANSFER INPUT 43 0 0.008272 0.000358 function
1702576992601311 2101488 TRANSFER INPUT 43 0 0.000127 0.000324 map
1702576992601814 2101488 TASK 43 RUNNING 128.55.64.46:37894  FIRST_RESOURCES {"memory":[5333,"MB"],"disk":[144686,"MB"],"gpus":[0,"gpus"],"cores":[85,"cores"]}

And then looking at the worker log, we can clearly see that there are three tasks running, and then at some point it drops down to two, and doesn't come back.

dthain commented 10 months ago

A hypothesis:

I cannot think of any reason why cores or memory would be accounted for incorrectly. But disk is a little different. In the absence of other information, the manager gives each task 1/3 of the available disk space known to the worker. And, disk space drops as items get loaded into the workers cache. So, it could be declining to start more workers because the perceived disk space remaining is less than the 1/3 automatic request.

@btovar is the right person to work on this, however, we are just now ramping things down from the winter break. He probably won't be able to work on this until January.

However, I think an easy workaround that you could try from the Parsl side would be to manually specify the quantity of disk (and cores too) needed by the task: https://github.com/Parsl/parsl/blob/4838c73f7c352174bc0b62b17f6f57c59a3add2a/parsl/executors/workqueue/executor.py#L908

If that solves the specific problem, then I think we can drill into the general problem in January.\

benclifford commented 10 months ago

Right now the resource specification looks like this from the parsl side:

1702576991.888290 2023-12-14 10:03:11 MainProcess-2098694 MainThread-140489703003968 parsl.executors.workqueue.executor:402 submit DEBUG: Got resource_specification: {'memory': 5120, 'cores': 1, 'disk': 0}

1702576991888399 2101488 TASK 1 WAITING 00038065_008_011 FIRST_RESOURCES {"memory":[5120,"MB"],"disk":[0,"MB"],"cores":[1,"cores"]}

and Work Queue fluffs those parameters up into some bigger number in the RUNNING line:

1702576992598877 2101488 TASK 51 RUNNING 128.55.64.46:37894 FIRST_RESOURCES {"memory":[5333,"MB"],"disk":[144686,"MB"],"gpus":[0,"gpus"],"cores":[85,"cores"]}

Are you suggesting setting a non-zero disk usage here and expecting WQ to not inflate the usage like it is doing at the moment?

dthain commented 9 months ago

Yes, I think the best workaround is for Parsl to give each task a small non-zero expected disk usage.

benclifford commented 9 months ago

I don't really understand what WQ is doing here for resource allocation. If I specify disk, cores, memory like this:

$ parsl-perf --time 1 --config parsl/tests/configs/workqueue_ex.py --resources '{"cores":1, "memory":2,"disk":3}'

then I see those three values reflected requested in the Work Queue transaction_log (via the codepath @dthain noted in his last comment):

1707304294110815 19397 TASK 1 WAITING app FIRST_RESOURCES {"memory":[2,"MB"],"disk":[3,"MB"],"cores":[1,"cores"]}

but then when that task begins to run, despite my explicit disk specification, WQ looks like it has still allocated it more disk than I explicitly requested:

1707304297048554 19397 TASK 1 RUNNING 172.17.0.2:37908  FIRST_RESOURCES {"memory":[1963,"MB"],"disk":[67,"MB"],"gpus":[0,"gpus"],"cores":[1,"cores"]}

It's increasing the disk allocation to 1/8th of what's available on the worker's current-directory filesystem (67MB = 18th of df reported 539M) - and likewise something similar with memory.

So I think (although I don't have a reproducible test case) the original problem is still happening here.