Open benclifford opened 11 months ago
Hmm I think your intuition is correct here. About the only reason a worker would delay running a task would be insufficient resources available. However, the manager should not be generally sending workers more than it thinks they can run right away. (Unless "overcommit" is enabled in vine_tune
.)
Here is what I see in the transaction_log
.
The tasks are submitted with an initial guess of 5GB memory and one core each:
1702576991888399 2101488 TASK 1 WAITING 00038065_008_011 FIRST_RESOURCES {"memory":[5120,"MB"],"disk":[0,"MB"],
"cores":[1,"cores"]}
1702576992559713 2101488 TASK 2 WAITING 00038065_022_025 FIRST_RESOURCES {"memory":[5120,"MB"],"disk":[0,"MB"],
"cores":[1,"cores"]}
1702576992560395 2101488 TASK 3 WAITING 00038065_026_029 FIRST_RESOURCES {"memory":[5120,"MB"],"disk":[0,"MB"],
"cores":[1,"cores"]}
The manager assigns tasks to the worker. It looks to be in the round-up mode -- since three 5GB tasks fit into the 16GB RAM available, it divides the available resources in three and gives each task 5333MB and 85 cores each. So far so good.
1702576992598877 2101488 TASK 51 RUNNING 128.55.64.46:37894 FIRST_RESOURCES {"memory":[5333,"MB"],"disk":[144686,"MB"],"gpus":[0,"gpus"],"cores":[85,"cores"]}
1702576992599033 2101488 TRANSFER INPUT 48 1 0.000000 0.000005 exec_parsl_function.py
1702576992599444 2101488 TRANSFER INPUT 48 0 0.008276 0.000386 function
1702576992599867 2101488 TRANSFER INPUT 48 0 0.000127 0.000369 map
1702576992600385 2101488 TASK 48 RUNNING 128.55.64.46:37894 FIRST_RESOURCES {"memory":[5333,"MB"],"disk":[144686,"MB"],"gpus":[0,"gpus"],"cores":[85,"cores"]}
1702576992600554 2101488 TRANSFER INPUT 43 1 0.000000 0.000004 exec_parsl_function.py
1702576992600936 2101488 TRANSFER INPUT 43 0 0.008272 0.000358 function
1702576992601311 2101488 TRANSFER INPUT 43 0 0.000127 0.000324 map
1702576992601814 2101488 TASK 43 RUNNING 128.55.64.46:37894 FIRST_RESOURCES {"memory":[5333,"MB"],"disk":[144686,"MB"],"gpus":[0,"gpus"],"cores":[85,"cores"]}
And then looking at the worker log, we can clearly see that there are three tasks running, and then at some point it drops down to two, and doesn't come back.
A hypothesis:
I cannot think of any reason why cores or memory would be accounted for incorrectly. But disk is a little different. In the absence of other information, the manager gives each task 1/3 of the available disk space known to the worker. And, disk space drops as items get loaded into the workers cache. So, it could be declining to start more workers because the perceived disk space remaining is less than the 1/3 automatic request.
@btovar is the right person to work on this, however, we are just now ramping things down from the winter break. He probably won't be able to work on this until January.
However, I think an easy workaround that you could try from the Parsl side would be to manually specify the quantity of disk (and cores too) needed by the task: https://github.com/Parsl/parsl/blob/4838c73f7c352174bc0b62b17f6f57c59a3add2a/parsl/executors/workqueue/executor.py#L908
If that solves the specific problem, then I think we can drill into the general problem in January.\
Right now the resource specification looks like this from the parsl side:
1702576991.888290 2023-12-14 10:03:11 MainProcess-2098694 MainThread-140489703003968 parsl.executors.workqueue.executor:402 submit DEBUG: Got resource_specification: {'memory': 5120, 'cores': 1, 'disk': 0}
1702576991888399 2101488 TASK 1 WAITING 00038065_008_011 FIRST_RESOURCES {"memory":[5120,"MB"],"disk":[0,"MB"],"cores":[1,"cores"]}
and Work Queue fluffs those parameters up into some bigger number in the RUNNING
line:
1702576992598877 2101488 TASK 51 RUNNING 128.55.64.46:37894 FIRST_RESOURCES {"memory":[5333,"MB"],"disk":[144686,"MB"],"gpus":[0,"gpus"],"cores":[85,"cores"]}
Are you suggesting setting a non-zero disk usage here and expecting WQ to not inflate the usage like it is doing at the moment?
Yes, I think the best workaround is for Parsl to give each task a small non-zero expected disk usage.
I don't really understand what WQ is doing here for resource allocation. If I specify disk, cores, memory like this:
$ parsl-perf --time 1 --config parsl/tests/configs/workqueue_ex.py --resources '{"cores":1, "memory":2,"disk":3}'
then I see those three values reflected requested in the Work Queue transaction_log
(via the codepath @dthain noted in his last comment):
1707304294110815 19397 TASK 1 WAITING app FIRST_RESOURCES {"memory":[2,"MB"],"disk":[3,"MB"],"cores":[1,"cores"]}
but then when that task begins to run, despite my explicit disk specification, WQ looks like it has still allocated it more disk than I explicitly requested:
1707304297048554 19397 TASK 1 RUNNING 172.17.0.2:37908 FIRST_RESOURCES {"memory":[1963,"MB"],"disk":[67,"MB"],"gpus":[0,"gpus"],"cores":[1,"cores"]}
It's increasing the disk allocation to 1/8th of what's available on the worker's current-directory filesystem (67MB = 18th of df
reported 539M) - and likewise something similar with memory.
So I think (although I don't have a reproducible test case) the original problem is still happening here.
This is with WQ 7.6.1 with a user I am supporting at DESC.
Running some user application, the work queue manager code assigns 3 tasks to a worker (that 3 being the number of tasks that fit into the worker resources).
However, after a while the worker only ever has two tasks spawned.
In worker debug, I see a very long delay from a task arriving at the worker and the relevant process being launched - a new task process isn't launched until a previous process ends, with the task sitting on the worker for (in the below case) 3 minutes:
In the attached tarball, for example,
submit_scripts/parsl.work_queue.block-0.1702576992.2495306.sh.err
containswork_queue_worker -d all
output, and one wrong looking task is:Other logs for WQ are in
009/work_queue
I couldn't see anything about this immediately obvious in
git logs
between version 7.6.1 and now.jchiang-exhaustion.tar.gz