Failing gracefully when user fileshare runs out of space

jluethi commented 1 year ago

Currently, when the user fileshare runs out of space, the tasks fail (with errors like the one below), but the server keeps the job status at "running".

Possible that the user isn't writing the slurm output files correctly anymore in this state. Is there a way we can detect this and have the server also show the pipeline as failed with a relevant error? We see the following on the running server when looking at what it prints to console:

Exception in thread Thread-234:

Traceback (most recent call last):

File "/data/homes/fractal/.conda/envs/fractal-server-1.0.2/lib/python3.8/site-packages/fractal_server/app/runner/_slurm/executor.py", line 449, in _completion

fut.set_result(output)

File "/data/homes/fractal/.conda/envs/fractal-server-1.0.2/lib/python3.8/concurrent/futures/_base.py", line 532, in set_result

raise InvalidStateError('{}: {!r}'.format(self._state, self))

concurrent.futures._base.InvalidStateError: CANCELLED: <Future at 0x7f7c91fa0490 state=cancelled>

Example error for out of storage space for the task:

File "/data/homes/fractal/joel/fractal_v1/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.7.0/venv/lib/python3.8/site-packages/zarr/core.py", line 1772, in _set_basic_selection_nd
self._set_selection(indexer, value, fields=fields)
File "/data/homes/fractal/joel/fractal_v1/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.7.0/venv/lib/python3.8/site-packages/zarr/core.py", line 1824, in _set_selection
self._chunk_setitem(chunk_coords, chunk_selection, chunk_value, fields=fields)
File "/data/homes/fractal/joel/fractal_v1/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.7.0/venv/lib/python3.8/site-packages/zarr/core.py", line 2089, in _chunk_setitem
self._chunk_setitem_nosync(chunk_coords, chunk_selection, value,
File "/data/homes/fractal/joel/fractal_v1/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.7.0/venv/lib/python3.8/site-packages/zarr/core.py", line 2100, in _chunk_setitem_nosync
self.chunk_store[ckey] = self._encode_chunk(cdata)
File "/data/homes/fractal/joel/fractal_v1/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.7.0/venv/lib/python3.8/site-packages/zarr/storage.py", line 1113, in setitem
self._tofile(value, temp_path)
File "/data/homes/fractal/joel/fractal_v1/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.7.0/venv/lib/python3.8/site-packages/zarr/storage.py", line 1074, in _tofile
f.write(a)
OSError: [Errno 28] No space left on device

tcompa commented 1 year ago

Thanks, we'll look into this. In principle, when the the clusterfuture remote worker (the one that the user runs on the computing node, via SLURM) fails to write its output pickle file and (after a certain time interval) the job is not in SLURM squeue any more, then we raise a JobExecutionError. But this is clearly not happening here.

Most likely reason is that extra care is needed when using executor.map(function, ...) with a function that raises Exceptions. Refs:

@jluethi: in the current issue, can you confirm that the failing task was a parallel one?

jluethi commented 1 year ago

Hey Tommaso! Yes, it was a parallel task. Specifically, the Yokogawa to OME-Zarr one

tcompa commented 1 year ago

This should be fixed by #497, available in fractal-server=1.0.8. If we can try to reproduce the error, let's verify that it is handled correctly.

mfranzon commented 1 year ago

@jluethi have you ever encountered this error? It should be handled correctly with the #497. I think we can close for the moment and re-open if necessary.

jluethi commented 1 year ago

I'll try to reproduce it by storing a huge OME-Zarr file on my small home share. Will report back when I know how that went :)

jluethi commented 1 year ago

Ok, we fail better now. The workflow is actually tracked as "failed" now, with the following error message:

JOB ERROR:
TRACEBACK:
JobExecutionError

COMMAND:
Content of /net/nfs4/pelkmanslab-fileserver-fractal/data/homes/fractal/deployment_joel/fractal-demos/examples/server/fractal-logs/proj_0000026_wf_0000025_job_0000024/1_batch_000000_slurm_submit.sbatch:
#!/bin/sh
#SBATCH --partition=main
#SBATCH --ntasks=7
#SBATCH --cpus-per-task=1
#SBATCH --mem=28000M
#SBATCH --job-name=Convert_Yokogawa_to_OME-Zarr
#SBATCH --err=/net/nfs4/pelkmanslab-fileserver-jluethi/data/homes/jluethi/.fractal-cache/20230516_114559_proj_0000026_wf_0000025_job_0000024/1_batch_000000_slurm_%j.err
#SBATCH --out=/net/nfs4/pelkmanslab-fileserver-jluethi/data/homes/jluethi/.fractal-cache/20230516_114559_proj_0000026_wf_0000025_job_0000024/1_batch_000000_slurm_%j.out
export CELLPOSE_LOCAL_MODELS_PATH=/data/homes/jluethi/.fractal-cache/CELLPOSE_LOCAL_MODELS_PATH
export NUMBA_CACHE_DIR=/data/homes/jluethi/.fractal-cache/NUMBA_CACHE_DIR
export NAPARI_CONFIG=/data/homes/jluethi/.fractal-cache/napari_config.json
export XDG_CONFIG_HOME=/data/homes/jluethi/.fractal-cache/XDG_CONFIG
export XDG_CACHE_HOME=/data/homes/jluethi/.fractal-cache/XDG

srun --ntasks=1 --cpus-per-task=$SLURM_CPUS_PER_TASK --mem=4000MB /data/homes/fractal/deployment_joel/fractal-demos/examples/server/fractal-server-1.2.4/bin/python -m fractal_server.app.runner._slurm.remote --input-file /net/nfs4/pelkmanslab-fileserver-fractal/data/homes/fractal/deployment_joel/fractal-demos/examples/server/fractal-logs/proj_0000026_wf_0000025_job_0000024/1_par_20200812-CardiomyocyteDifferentiation14-Cycle1_zarr_B_03_0__in_yBYOTEKOSAWC2p3Lya03SQ9A7Su72Hmd.pickle --output-file /net/nfs4/pelkmanslab-fileserver-jluethi/data/homes/jluethi/.fractal-cache/20230516_114559_proj_0000026_wf_0000025_job_0000024/1_par_20200812-CardiomyocyteDifferentiation14-Cycle1_zarr_B_03_0__out_yBYOTEKOSAWC2p3Lya03SQ9A7Su72Hmd.pickle &
srun --ntasks=1 --cpus-per-task=$SLURM_CPUS_PER_TASK --mem=4000MB /data/homes/fractal/deployment_joel/fractal-demos/examples/server/fractal-server-1.2.4/bin/python -m fractal_server.app.runner._slurm.remote --input-file /net/nfs4/pelkmanslab-fileserver-fractal/data/homes/fractal/deployment_joel/fractal-demos/examples/server/fractal-logs/proj_0000026_wf_0000025_job_0000024/1_par_20200812-CardiomyocyteDifferentiation14-Cycle1_zarr_B_05_0__in_RurEmuW8U9P5dSKEEVjqchvxNNmH6pZi.pickle --output-file /net/nfs4/pelkmanslab-fileserver-jluethi/data/homes/jluethi/.fractal-cache/20230516_114559_proj_0000026_wf_0000025_job_0000024/1_par_20200812-CardiomyocyteDifferentiation14-Cycle1_zarr_B_05_0__out_RurEmuW8U9P5dSKEEVjqchvxNNmH6pZi.pickle &
srun --ntasks=1 --cpus-per-task=$SLURM_CPUS_PER_TASK --mem=4000MB /data/homes/fractal/deployment_joel/fractal-demos/examples/server/fractal-server-1.2.4/bin/python -m fractal_server.app.runner._slurm.remote --input-file /net/nfs4/pelkmanslab-fileserver-fractal/data/homes/fractal/deployment_joel/fractal-demos/examples/server/fractal-logs/proj_0000026_wf_0000025_job_0000024/1_par_20200812-CardiomyocyteDifferentiation14-Cycle1_zarr_B_09_0__in_bIXdPM9VQ131LvqvLyy5xRnMmuLuwQ1G.pickle --output-file /net/nfs4/pelkmanslab-fileserver-jluethi/data/homes/jluethi/.fractal-cache/20230516_114559_proj_0000026_wf_0000025_job_0000024/1_par_20200812-CardiomyocyteDifferentiation14-Cycle1_zarr_B_09_0__out_bIXdPM9VQ131LvqvLyy5xRnMmuLuwQ1G.pickle &
srun --ntasks=1 --cpus-per-task=$SLURM_CPUS_PER_TASK --mem=4000MB /data/homes/fractal/deployment_joel/fractal-demos/examples/server/fractal-server-1.2.4/bin/python -m fractal_server.app.runner._slurm.remote --input-file /net/nfs4/pelkmanslab-fileserver-fractal/data/homes/fractal/deployment_joel/fractal-demos/examples/server/fractal-logs/proj_0000026_wf_0000025_job_0000024/1_par_20200812-CardiomyocyteDifferentiation14-Cycle1_zarr_B_11_0__in_DAUqlLdjqOQ8mCj3USCxsCvwYGgcJXMA.pickle --output-file /net/nfs4/pelkmanslab-fileserver-jluethi/data/homes/jluethi/.fractal-cache/20230516_114559_proj_0000026_wf_0000025_job_0000024/1_par_20200812-CardiomyocyteDifferentiation14-Cycle1_zarr_B_11_0__out_DAUqlLdjqOQ8mCj3USCxsCvwYGgcJXMA.pickle &
srun --ntasks=1 --cpus-per-task=$SLURM_CPUS_PER_TASK --mem=4000MB /data/homes/fractal/deployment_joel/fractal-demos/examples/server/fractal-server-1.2.4/bin/python -m fractal_server.app.runner._slurm.remote --input-file /net/nfs4/pelkmanslab-fileserver-fractal/data/homes/fractal/deployment_joel/fractal-demos/examples/server/fractal-logs/proj_0000026_wf_0000025_job_0000024/1_par_20200812-CardiomyocyteDifferentiation14-Cycle1_zarr_C_04_0__in_P5EqiowLi057IIJkWno8KOrVdUZKe3LX.pickle --output-file /net/nfs4/pelkmanslab-fileserver-jluethi/data/homes/jluethi/.fractal-cache/20230516_114559_proj_0000026_wf_0000025_job_0000024/1_par_20200812-CardiomyocyteDifferentiation14-Cycle1_zarr_C_04_0__out_P5EqiowLi057IIJkWno8KOrVdUZKe3LX.pickle &
srun --ntasks=1 --cpus-per-task=$SLURM_CPUS_PER_TASK --mem=4000MB /data/homes/fractal/deployment_joel/fractal-demos/examples/server/fractal-server-1.2.4/bin/python -m fractal_server.app.runner._slurm.remote --input-file /net/nfs4/pelkmanslab-fileserver-fractal/data/homes/fractal/deployment_joel/fractal-demos/examples/server/fractal-logs/proj_0000026_wf_0000025_job_0000024/1_par_20200812-CardiomyocyteDifferentiation14-Cycle1_zarr_C_06_0__in_GL02pdhq6vD6utH1Qi1RlpyjEyAZ4MKS.pickle --output-file /net/nfs4/pelkmanslab-fileserver-jluethi/data/homes/jluethi/.fractal-cache/20230516_114559_proj_0000026_wf_0000025_job_0000024/1_par_20200812-CardiomyocyteDifferentiation14-Cycle1_zarr_C_06_0__out_GL02pdhq6vD6utH1Qi1RlpyjEyAZ4MKS.pickle &
srun --ntasks=1 --cpus-per-task=$SLURM_CPUS_PER_TASK --mem=4000MB /data/homes/fractal/deployment_joel/fractal-demos/examples/server/fractal-server-1.2.4/bin/python -m fractal_server.app.runner._slurm.remote --input-file /net/nfs4/pelkmanslab-fileserver-fractal/data/homes/fractal/deployment_joel/fractal-demos/examples/server/fractal-logs/proj_0000026_wf_0000025_job_0000024/1_par_20200812-CardiomyocyteDifferentiation14-Cycle1_zarr_C_08_0__in_EyPYbLufBacsRQhC5GzpU2THNu8CZuge.pickle --output-file /net/nfs4/pelkmanslab-fileserver-jluethi/data/homes/jluethi/.fractal-cache/20230516_114559_proj_0000026_wf_0000025_job_0000024/1_par_20200812-CardiomyocyteDifferentiation14-Cycle1_zarr_C_08_0__out_EyPYbLufBacsRQhC5GzpU2THNu8CZuge.pickle &
wait

STDOUT:
File /net/nfs4/pelkmanslab-fileserver-fractal/data/homes/fractal/deployment_joel/fractal-demos/examples/server/fractal-logs/proj_0000026_wf_0000025_job_0000024/1_batch_000000_slurm_9647792.out is empty

STDERR:
File /net/nfs4/pelkmanslab-fileserver-fractal/data/homes/fractal/deployment_joel/fractal-demos/examples/server/fractal-logs/proj_0000026_wf_0000025_job_0000024/1_batch_000000_slurm_9647792.err is empty

ADDITIONAL INFO:
Output pickle file of the FractalSlurmExecutor job not found.
Expected file path: /net/nfs4/pelkmanslab-fileserver-fractal/data/homes/fractal/deployment_joel/fractal-demos/examples/server/fractal-logs/proj_0000026_wf_0000025_job_0000024/1_par_20200812-CardiomyocyteDifferentiation14-Cycle1_zarr_B_03_0__out_yBYOTEKOSAWC2p3Lya03SQ9A7Su72Hmd.pickle.
Here are some possible reasons:
1. The SLURM job was scancel-ed, either by the user or due to an error (e.g. an out-of-memory or timeout error). Note that if the scancel took place before the job started running, the SLURM out/err files will be empty.
2. Some error occurred upon writing the file to disk (e.g. due to an overloaded NFS filesystem). Note that the server configuration has FRACTAL_SLURM_OUTPUT_FILE_GRACE_TIME=4 seconds.

It doesn't specify the out of space option here though, but suggested 2 other scenarios. We could add it to the additional info reasons as 3 (or as part of option 2, it's kind of an overloaded filesystem in the sense that it's just full). If we add it there, than I'd say we can close this issue :)

fractal-analytics-platform / fractal-server

Failing gracefully when user fileshare runs out of space #492