Is your feature request related to a problem? Please describe.
Currently, Parsl places tasks onto workers with no planning/awareness regarding the expected duration of the task and the remaining walltime on workers. This leads to tasks being placed onto workers close to their walltime and getting killed by the Batch scheduler. The current solution is to set max_retries > 0, which would allow Parsl to reschedule tasks that hit the walltime and failed. However, this has a few issues:
Allocation wastage. For long-running tasks repeated failure from hitting walltime can be far more wasteful than simply terminating the block if there are no tasks that would fit.
Tasks might need high retry_limits due to repeatedly getting terminated, and this is usually known only after a few runs.
Describe the solution you'd like
Apps specify expected walltimes at invocation time with parsl_resource_specification.
HTEX to have managers report pending walltimes
HTEX to limit scheduling tasks onto managers that have sufficient walltime for tasks
Additional context
This issue is related to a broader set of issues that all require better task-scheduling in the HighThroughputExecutor
Is your feature request related to a problem? Please describe.
Currently, Parsl places tasks onto workers with no planning/awareness regarding the expected duration of the task and the remaining walltime on workers. This leads to tasks being placed onto workers close to their walltime and getting killed by the Batch scheduler. The current solution is to set
max_retries
> 0, which would allow Parsl to reschedule tasks that hit the walltime and failed. However, this has a few issues:Describe the solution you'd like
parsl_resource_specification
.Additional context
This issue is related to a broader set of issues that all require better task-scheduling in the
HighThroughputExecutor