Summary:
Running ESA Landcover as a Prefect deployment with "hpc" backend (using PBSCluster with dask-jobqueue) with more than one process per node consistently fails. Potentially memory related, but actual error is tied to Prefect states.
Replicable:
Yes
Error:
Created task run 'download-3a4df3bd-2' for task 'download'
15:59:27.829 | INFO | Flow run 'tunneling-puma' - Submitted task run 'download-3a4df3bd-2' for execution.
15:59:46.917 | INFO | Flow run 'tunneling-puma' - Task run download completed with 4 successes and no errors
15:59:46.921 | INFO | Flow run 'tunneling-puma' - Running processing
15:59:47.052 | INFO | Flow run 'tunneling-puma' - Created task run 'process-27395ae7-0' for task 'process'
15:59:47.059 | INFO | Flow run 'tunneling-puma' - Submitted task run 'process-27395ae7-0' for execution.
15:59:47.086 | INFO | Flow run 'tunneling-puma' - Created task run 'process-27395ae7-2' for task 'process'
15:59:47.092 | INFO | Flow run 'tunneling-puma' - Submitted task run 'process-27395ae7-2' for execution.
15:59:47.098 | INFO | Flow run 'tunneling-puma' - Created task run 'process-27395ae7-1' for task 'process'
15:59:47.104 | INFO | Flow run 'tunneling-puma' - Submitted task run 'process-27395ae7-1' for execution.
15:59:47.110 | INFO | Flow run 'tunneling-puma' - Created task run 'process-27395ae7-3' for task 'process'
15:59:47.117 | INFO | Flow run 'tunneling-puma' - Submitted task run 'process-27395ae7-3' for execution.
16:20:04.150 | INFO | Flow run 'tunneling-puma' - Task run process completed with 3 successes and no errors
16:20:04.506 | ERROR | Flow run 'tunneling-puma' - Finished in state Failed('1/8 states are not final.')
16:20:04.508 | ERROR | Flow run 'devious-gibbon' - Encountered exception during execution:
Traceback (most recent call last):
File "/sciclone/home20/smgoodman/.conda/envs/geodata38/lib/python3.8/site-packages/prefect/engine.py", line 637, in orchestrate_flow_run
result = await run_sync(flow_call)
File "/sciclone/home20/smgoodman/.conda/envs/geodata38/lib/python3.8/site-packages/prefect/utilities/asyncutils.py", line 69, in run_sync_in_worker_thread
return await anyio.to_thread.run_sync(call, cancellable=True)
File "/sciclone/home20/smgoodman/.conda/envs/geodata38/lib/python3.8/site-packages/anyio/to_thread.py", line 28, in run_sync
return await get_asynclib().run_sync_in_worker_thread(func, *args, cancellable=cancellable,
File "/sciclone/home20/smgoodman/.conda/envs/geodata38/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 818, in run_sync_in_worker_thread
return await future
File "/sciclone/home20/smgoodman/.conda/envs/geodata38/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 754, in run
result = context.run(func, *args)
File "esa_landcover/flow.py", line 80, in esa_landcover
class_instance.run(backend=backend, task_runner=task_runner, run_parallel=run_parallel, max_workers=max_workers, log_dir=timestamp_log_dir, cluster=cluster, cluster_kwargs=cluster_kwargs)
File "/sciclone/scr20/smgoodman/tmpd_yqjnwcprefect/global_scripts/dataset.py", line 406, in run
prefect_main_wrapper()
File "/sciclone/home20/smgoodman/.conda/envs/geodata38/lib/python3.8/site-packages/prefect/flows.py", line 448, in __call__
return enter_flow_run_engine_from_flow_call(
File "/sciclone/home20/smgoodman/.conda/envs/geodata38/lib/python3.8/site-packages/prefect/engine.py", line 168, in enter_flow_run_engine_from_flow_call
return run_async_from_worker_thread(begin_run)
File "/sciclone/home20/smgoodman/.conda/envs/geodata38/lib/python3.8/site-packages/prefect/utilities/asyncutils.py", line 152, in run_async_from_worker_thread
return anyio.from_thread.run(call)
File "/sciclone/home20/smgoodman/.conda/envs/geodata38/lib/python3.8/site-packages/anyio/from_thread.py", line 35, in run
return asynclib.run_async_from_thread(func, *args)
File "/sciclone/home20/smgoodman/.conda/envs/geodata38/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 847, in run_async_from_thread
return f.result()
File "/sciclone/home20/smgoodman/.conda/envs/geodata38/lib/python3.8/concurrent/futures/_base.py", line 444, in result
return self.__get_result()
File "/sciclone/home20/smgoodman/.conda/envs/geodata38/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
raise self._exception
File "/sciclone/home20/smgoodman/.conda/envs/geodata38/lib/python3.8/site-packages/prefect/client/utilities.py", line 47, in with_injected_client
return await fn(*args, **kwargs)
File "/sciclone/home20/smgoodman/.conda/envs/geodata38/lib/python3.8/site-packages/prefect/engine.py", line 546, in create_and_begin_subflow_run
return await terminal_state.result(fetch=True)
File "/sciclone/home20/smgoodman/.conda/envs/geodata38/lib/python3.8/site-packages/prefect/states.py", line 88, in _get_state_result
raise await get_state_exception(state)
File "/sciclone/home20/smgoodman/.conda/envs/geodata38/lib/python3.8/site-packages/prefect/states.py", line 350, in get_state_exception
raise ValueError(
ValueError: Failed state result was an iterable of states but none were failed.
16:20:04.651 | ERROR | Flow run 'devious-gibbon' - Finished in state Failed('Flow run encountered an exception. ValueError: Failed state result was an iterable of states but none were failed.\n')
Failing condition:
Prefect deployment using "hpc" backend with >1 processes per node in a single job (only tested with 1 node per job)
Successful conditions:
in any local mode (non-deployment and non-hpc)
deployment using basic "dask" backend
deployment using "hpc" backend and only 1 process per job created by dask-jobqueue (1 process per node)
Under all working conditions an individual task (processing 1 year of data) requires <2gb of RAM, yet when run in the failing condition each task approaches around 10gb. Even under failing conditions, many tasks seem to succeed. In a test of 4 years of data being run on 2 nodes (2 processes per node), 3 out of 4 succeed before the 4th appears to succeed but fails to return a proper state object from the task to the Prefect scheduler.
Not tested:
specifying dask-jobqueue jobs for more than 1 node and using multiple processes (e.g., 2 nodes, 2 processes)
If somehow this is isolated to this dataset, I am okay leaving this as a low-priority issue since it does work under some conditions and we will not need to run this full dataset processing often (if ever again). We will keep an eye on this as additional datasets are implemented using Prefect deployments on the HPC.
Summary: Running ESA Landcover as a Prefect deployment with "hpc" backend (using PBSCluster with dask-jobqueue) with more than one process per node consistently fails. Potentially memory related, but actual error is tied to Prefect states.
Replicable: Yes
Error:
Failing condition:
Successful conditions:
Under all working conditions an individual task (processing 1 year of data) requires <2gb of RAM, yet when run in the failing condition each task approaches around 10gb. Even under failing conditions, many tasks seem to succeed. In a test of 4 years of data being run on 2 nodes (2 processes per node), 3 out of 4 succeed before the 4th appears to succeed but fails to return a proper state object from the task to the Prefect scheduler.
Not tested:
If somehow this is isolated to this dataset, I am okay leaving this as a low-priority issue since it does work under some conditions and we will not need to run this full dataset processing often (if ever again). We will keep an eye on this as additional datasets are implemented using Prefect deployments on the HPC.