Closed benclifford closed 11 months ago
@benclifford thanks for raising this. Based on your explanation I used the same approach you used to debug in WQ here in RP.
I quote from the WQ ticket:
submit two tasks, one which depends on the other. observe a 1 second delay between task 1 completing on the worker and task 2 starting on the worker.
These are the numbers that I got when I submitted 2 tasks (Python functions) where task2
depends on task1
and i measure from the time task 1 was completed - the time task2 was launched from 2 endpoints:
This is my reproducer:
# Map function that returns double the input integer
@python_app
def app_double(x, parsl_resource_specification={}):
return x * 2
@python_app
def app_sum(inputs):
x = sum(inputs)
return x
t1 = app_double(5, parsl_resource_specification={'ranks':1})
t2 = app_sum([t1.result()])
print(t2.result())
1701275858.578 : master.000000.worker.0000 : 14556 : 139647799764736 : DEBUG : got result: [{'type': 'task', 'tmgr': 'tmgr.0000', 'uid': 'task.000000'
Task2 is was launched on the worker:
1701275858.699 : master.000000.worker.0000 : 14570 : 140609763399424 : DEBUG : applied: task.000001: 14665: dict_keys([14665])
Time difference:
1701275858.699 - 1701275858.578 = ~0.1209s
1701275858.608324 2023-11-29 11:37:38 MainProcess-11909 Thread-22-140435271960320 parsl.dataflow.dflow:552 _complete_task INFO: Task 0 completed (launched -> exec_done)
Task2 was launched on the worker:
1701275858.617444 2023-11-29 11:37:38 MainProcess-11909 MainThread-140437814228800 parsl.dataflow.dflow:728 launch_task INFO: Parsl task 1 try 0 launched on executor RPEX
Time difference:
1701275858.617444 - 1701275858.608324 = ~0.00911s
If my analysis is correct I do not think that this is the issue and we need to dig deeper.
@benclifford this PR #2988 improves the performance in general and should address the issue of this ticket.
Describe the bug In the pytest test suite, the test suite takes around 10 minutes to execute.
This is in comparison to around 11 seconds with the high throughput executor and 3 seconds with the local thread executor.
I haven't dug into the codebase but this is likely due to delays in task-end-to-next-task-starting that comes from some polling loop - that was the problem in htex and is a problem in the Work Queue executor (see #2801)
This is annoying in CI, but also could be a performance problem for end users running workflows which have a lot of "task-end-causes-next-task-to-launch" behaviour: for example, workflows with deep dependencies and short tasks; and workflows where the user has implemented their own steering mechanism (eg @WardLT 's work)
To Reproduce see CI
Expected behavior This should be "fast"
Environment CI
cc @AymenFJA @andre-merzky