Parsl / parsl

Parsl - a Python parallel scripting library
http://parsl-project.org
Apache License 2.0
498 stars 195 forks source link

failing `process_worker_pool.py` results in silent hang (with local config) #578

Open benclifford opened 6 years ago

benclifford commented 6 years ago

Using parsl/tests/configs/htex_local.py, a failing process_worker_pool.py (for example, not on path, or failing to start up) results in a silent hang, rather than any diagnostic information or workflow exit.

Similarly with parsl/tests/configs/exex_local.py and failing mpi_worker_pool.py.

2018-10-17 21:32:57 parsl.dataflow.dflow:616 [INFO]  Task 0 submitted for App foo, waiting on tasks []
2018-10-17 21:32:57 parsl.executors.high_throughput.executor:336 [DEBUG]  Pushing function <function foo at 0x7f46e221dd90> to queue with args ()
2018-10-17 21:32:57 parsl.dataflow.dflow:419 [INFO]  Task 0 launched on executor htex_Local
2018-10-17 21:32:57 parsl.dataflow.dflow:637 [DEBUG]  Task 0 launched with AppFuture: <AppFuture at 0x7f46e221e160 state=pending>
yadudoc commented 6 years ago

Here's what is going on here: Information on the state of blocks before they have connected is only known to the provider. In the initial cut of the HighThroughputExecutor as well as the ExtremeScaleExecutor we don't support scaling strategies that monitor the state of launched blocks. Besides this specific situation, we don't have a good way of dealing with these failures in general. For instance if we see a failure we just launch more blocks, because the provider is not smart enough to determine the cause of the block failure.

benclifford commented 5 years ago

crossreg #1035