ESA Landcover PBS-Jobqueue Prefect Deployment Limitation

Summary: Running ESA Landcover as a Prefect deployment with "hpc" backend (using PBSCluster with dask-jobqueue) with more than one process per node consistently fails. Potentially memory related, but actual error is tied to Prefect states.

Replicable: Yes

Error:

Created task run 'download-3a4df3bd-2' for task 'download'
15:59:27.829 | INFO    | Flow run 'tunneling-puma' - Submitted task run 'download-3a4df3bd-2' for execution.
15:59:46.917 | INFO    | Flow run 'tunneling-puma' - Task run download completed with 4 successes and no errors
15:59:46.921 | INFO    | Flow run 'tunneling-puma' - Running processing
15:59:47.052 | INFO    | Flow run 'tunneling-puma' - Created task run 'process-27395ae7-0' for task 'process'
15:59:47.059 | INFO    | Flow run 'tunneling-puma' - Submitted task run 'process-27395ae7-0' for execution.
15:59:47.086 | INFO    | Flow run 'tunneling-puma' - Created task run 'process-27395ae7-2' for task 'process'
15:59:47.092 | INFO    | Flow run 'tunneling-puma' - Submitted task run 'process-27395ae7-2' for execution.
15:59:47.098 | INFO    | Flow run 'tunneling-puma' - Created task run 'process-27395ae7-1' for task 'process'
15:59:47.104 | INFO    | Flow run 'tunneling-puma' - Submitted task run 'process-27395ae7-1' for execution.
15:59:47.110 | INFO    | Flow run 'tunneling-puma' - Created task run 'process-27395ae7-3' for task 'process'
15:59:47.117 | INFO    | Flow run 'tunneling-puma' - Submitted task run 'process-27395ae7-3' for execution.
16:20:04.150 | INFO    | Flow run 'tunneling-puma' - Task run process completed with 3 successes and no errors
16:20:04.506 | ERROR   | Flow run 'tunneling-puma' - Finished in state Failed('1/8 states are not final.')
16:20:04.508 | ERROR   | Flow run 'devious-gibbon' - Encountered exception during execution:
Traceback (most recent call last):
  File "/sciclone/home20/smgoodman/.conda/envs/geodata38/lib/python3.8/site-packages/prefect/engine.py", line 637, in orchestrate_flow_run
    result = await run_sync(flow_call)
  File "/sciclone/home20/smgoodman/.conda/envs/geodata38/lib/python3.8/site-packages/prefect/utilities/asyncutils.py", line 69, in run_sync_in_worker_thread
    return await anyio.to_thread.run_sync(call, cancellable=True)
  File "/sciclone/home20/smgoodman/.conda/envs/geodata38/lib/python3.8/site-packages/anyio/to_thread.py", line 28, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(func, *args, cancellable=cancellable,
  File "/sciclone/home20/smgoodman/.conda/envs/geodata38/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 818, in run_sync_in_worker_thread
    return await future
  File "/sciclone/home20/smgoodman/.conda/envs/geodata38/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 754, in run
    result = context.run(func, *args)
  File "esa_landcover/flow.py", line 80, in esa_landcover
    class_instance.run(backend=backend, task_runner=task_runner, run_parallel=run_parallel, max_workers=max_workers, log_dir=timestamp_log_dir, cluster=cluster, cluster_kwargs=cluster_kwargs)
  File "/sciclone/scr20/smgoodman/tmpd_yqjnwcprefect/global_scripts/dataset.py", line 406, in run
    prefect_main_wrapper()
  File "/sciclone/home20/smgoodman/.conda/envs/geodata38/lib/python3.8/site-packages/prefect/flows.py", line 448, in __call__
    return enter_flow_run_engine_from_flow_call(
  File "/sciclone/home20/smgoodman/.conda/envs/geodata38/lib/python3.8/site-packages/prefect/engine.py", line 168, in enter_flow_run_engine_from_flow_call
    return run_async_from_worker_thread(begin_run)
  File "/sciclone/home20/smgoodman/.conda/envs/geodata38/lib/python3.8/site-packages/prefect/utilities/asyncutils.py", line 152, in run_async_from_worker_thread
    return anyio.from_thread.run(call)
  File "/sciclone/home20/smgoodman/.conda/envs/geodata38/lib/python3.8/site-packages/anyio/from_thread.py", line 35, in run
    return asynclib.run_async_from_thread(func, *args)
  File "/sciclone/home20/smgoodman/.conda/envs/geodata38/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 847, in run_async_from_thread
    return f.result()
  File "/sciclone/home20/smgoodman/.conda/envs/geodata38/lib/python3.8/concurrent/futures/_base.py", line 444, in result
    return self.__get_result()
  File "/sciclone/home20/smgoodman/.conda/envs/geodata38/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/sciclone/home20/smgoodman/.conda/envs/geodata38/lib/python3.8/site-packages/prefect/client/utilities.py", line 47, in with_injected_client
    return await fn(*args, **kwargs)
  File "/sciclone/home20/smgoodman/.conda/envs/geodata38/lib/python3.8/site-packages/prefect/engine.py", line 546, in create_and_begin_subflow_run
    return await terminal_state.result(fetch=True)
  File "/sciclone/home20/smgoodman/.conda/envs/geodata38/lib/python3.8/site-packages/prefect/states.py", line 88, in _get_state_result
    raise await get_state_exception(state)
  File "/sciclone/home20/smgoodman/.conda/envs/geodata38/lib/python3.8/site-packages/prefect/states.py", line 350, in get_state_exception
    raise ValueError(
ValueError: Failed state result was an iterable of states but none were failed.
16:20:04.651 | ERROR   | Flow run 'devious-gibbon' - Finished in state Failed('Flow run encountered an exception. ValueError: Failed state result was an iterable of states but none were failed.\n')

Failing condition:

Prefect deployment using "hpc" backend with >1 processes per node in a single job (only tested with 1 node per job)

Successful conditions:

in any local mode (non-deployment and non-hpc)
deployment using basic "dask" backend
deployment using "hpc" backend and only 1 process per job created by dask-jobqueue (1 process per node)

Under all working conditions an individual task (processing 1 year of data) requires <2gb of RAM, yet when run in the failing condition each task approaches around 10gb. Even under failing conditions, many tasks seem to succeed. In a test of 4 years of data being run on 2 nodes (2 processes per node), 3 out of 4 succeed before the 4th appears to succeed but fails to return a proper state object from the task to the Prefect scheduler.

Not tested:

specifying dask-jobqueue jobs for more than 1 node and using multiple processes (e.g., 2 nodes, 2 processes)

If somehow this is isolated to this dataset, I am okay leaving this as a low-priority issue since it does work under some conditions and we will not need to run this full dataset processing often (if ever again). We will keep an eye on this as additional datasets are implemented using Prefect deployments on the HPC.

aiddata / geo-datasets

ESA Landcover PBS-Jobqueue Prefect Deployment Limitation #142