Parsl / parsl

Parsl - a Python parallel scripting library
http://parsl-project.org
Apache License 2.0
508 stars 195 forks source link

Support for matching tasks to block based on remaining walltime #3249

Open yadudoc opened 8 months ago

yadudoc commented 8 months ago

Is your feature request related to a problem? Please describe.

Currently, Parsl places tasks onto workers with no planning/awareness regarding the expected duration of the task and the remaining walltime on workers. This leads to tasks being placed onto workers close to their walltime and getting killed by the Batch scheduler. The current solution is to set max_retries > 0, which would allow Parsl to reschedule tasks that hit the walltime and failed. However, this has a few issues:

  1. Allocation wastage. For long-running tasks repeated failure from hitting walltime can be far more wasteful than simply terminating the block if there are no tasks that would fit.
  2. Tasks might need high retry_limits due to repeatedly getting terminated, and this is usually known only after a few runs.

Describe the solution you'd like

Additional context

This issue is related to a broader set of issues that all require better task-scheduling in the HighThroughputExecutor

benclifford commented 1 month ago

crossref #3323