Closed jluethi closed 1 year ago
Thanks, we'll look into this. In principle, when the the clusterfuture remote worker (the one that the user runs on the computing node, via SLURM) fails to write its output pickle file and (after a certain time interval) the job is not in SLURM squeue any more, then we raise a JobExecutionError. But this is clearly not happening here.
Most likely reason is that extra care is needed when using executor.map(function, ...)
with a function that raises Exceptions. Refs:
Most likely related: https://github.com/fractal-analytics-platform/fractal-server/issues/482.
@jluethi: in the current issue, can you confirm that the failing task was a parallel one?
Hey Tommaso! Yes, it was a parallel task. Specifically, the Yokogawa to OME-Zarr one
This should be fixed by #497, available in fractal-server=1.0.8. If we can try to reproduce the error, let's verify that it is handled correctly.
@jluethi have you ever encountered this error? It should be handled correctly with the #497. I think we can close for the moment and re-open if necessary.
I'll try to reproduce it by storing a huge OME-Zarr file on my small home share. Will report back when I know how that went :)
Ok, we fail better now. The workflow is actually tracked as "failed" now, with the following error message:
JOB ERROR:
TRACEBACK:
JobExecutionError
COMMAND:
Content of /net/nfs4/pelkmanslab-fileserver-fractal/data/homes/fractal/deployment_joel/fractal-demos/examples/server/fractal-logs/proj_0000026_wf_0000025_job_0000024/1_batch_000000_slurm_submit.sbatch:
#!/bin/sh
#SBATCH --partition=main
#SBATCH --ntasks=7
#SBATCH --cpus-per-task=1
#SBATCH --mem=28000M
#SBATCH --job-name=Convert_Yokogawa_to_OME-Zarr
#SBATCH --err=/net/nfs4/pelkmanslab-fileserver-jluethi/data/homes/jluethi/.fractal-cache/20230516_114559_proj_0000026_wf_0000025_job_0000024/1_batch_000000_slurm_%j.err
#SBATCH --out=/net/nfs4/pelkmanslab-fileserver-jluethi/data/homes/jluethi/.fractal-cache/20230516_114559_proj_0000026_wf_0000025_job_0000024/1_batch_000000_slurm_%j.out
export CELLPOSE_LOCAL_MODELS_PATH=/data/homes/jluethi/.fractal-cache/CELLPOSE_LOCAL_MODELS_PATH
export NUMBA_CACHE_DIR=/data/homes/jluethi/.fractal-cache/NUMBA_CACHE_DIR
export NAPARI_CONFIG=/data/homes/jluethi/.fractal-cache/napari_config.json
export XDG_CONFIG_HOME=/data/homes/jluethi/.fractal-cache/XDG_CONFIG
export XDG_CACHE_HOME=/data/homes/jluethi/.fractal-cache/XDG
srun --ntasks=1 --cpus-per-task=$SLURM_CPUS_PER_TASK --mem=4000MB /data/homes/fractal/deployment_joel/fractal-demos/examples/server/fractal-server-1.2.4/bin/python -m fractal_server.app.runner._slurm.remote --input-file /net/nfs4/pelkmanslab-fileserver-fractal/data/homes/fractal/deployment_joel/fractal-demos/examples/server/fractal-logs/proj_0000026_wf_0000025_job_0000024/1_par_20200812-CardiomyocyteDifferentiation14-Cycle1_zarr_B_03_0__in_yBYOTEKOSAWC2p3Lya03SQ9A7Su72Hmd.pickle --output-file /net/nfs4/pelkmanslab-fileserver-jluethi/data/homes/jluethi/.fractal-cache/20230516_114559_proj_0000026_wf_0000025_job_0000024/1_par_20200812-CardiomyocyteDifferentiation14-Cycle1_zarr_B_03_0__out_yBYOTEKOSAWC2p3Lya03SQ9A7Su72Hmd.pickle &
srun --ntasks=1 --cpus-per-task=$SLURM_CPUS_PER_TASK --mem=4000MB /data/homes/fractal/deployment_joel/fractal-demos/examples/server/fractal-server-1.2.4/bin/python -m fractal_server.app.runner._slurm.remote --input-file /net/nfs4/pelkmanslab-fileserver-fractal/data/homes/fractal/deployment_joel/fractal-demos/examples/server/fractal-logs/proj_0000026_wf_0000025_job_0000024/1_par_20200812-CardiomyocyteDifferentiation14-Cycle1_zarr_B_05_0__in_RurEmuW8U9P5dSKEEVjqchvxNNmH6pZi.pickle --output-file /net/nfs4/pelkmanslab-fileserver-jluethi/data/homes/jluethi/.fractal-cache/20230516_114559_proj_0000026_wf_0000025_job_0000024/1_par_20200812-CardiomyocyteDifferentiation14-Cycle1_zarr_B_05_0__out_RurEmuW8U9P5dSKEEVjqchvxNNmH6pZi.pickle &
srun --ntasks=1 --cpus-per-task=$SLURM_CPUS_PER_TASK --mem=4000MB /data/homes/fractal/deployment_joel/fractal-demos/examples/server/fractal-server-1.2.4/bin/python -m fractal_server.app.runner._slurm.remote --input-file /net/nfs4/pelkmanslab-fileserver-fractal/data/homes/fractal/deployment_joel/fractal-demos/examples/server/fractal-logs/proj_0000026_wf_0000025_job_0000024/1_par_20200812-CardiomyocyteDifferentiation14-Cycle1_zarr_B_09_0__in_bIXdPM9VQ131LvqvLyy5xRnMmuLuwQ1G.pickle --output-file /net/nfs4/pelkmanslab-fileserver-jluethi/data/homes/jluethi/.fractal-cache/20230516_114559_proj_0000026_wf_0000025_job_0000024/1_par_20200812-CardiomyocyteDifferentiation14-Cycle1_zarr_B_09_0__out_bIXdPM9VQ131LvqvLyy5xRnMmuLuwQ1G.pickle &
srun --ntasks=1 --cpus-per-task=$SLURM_CPUS_PER_TASK --mem=4000MB /data/homes/fractal/deployment_joel/fractal-demos/examples/server/fractal-server-1.2.4/bin/python -m fractal_server.app.runner._slurm.remote --input-file /net/nfs4/pelkmanslab-fileserver-fractal/data/homes/fractal/deployment_joel/fractal-demos/examples/server/fractal-logs/proj_0000026_wf_0000025_job_0000024/1_par_20200812-CardiomyocyteDifferentiation14-Cycle1_zarr_B_11_0__in_DAUqlLdjqOQ8mCj3USCxsCvwYGgcJXMA.pickle --output-file /net/nfs4/pelkmanslab-fileserver-jluethi/data/homes/jluethi/.fractal-cache/20230516_114559_proj_0000026_wf_0000025_job_0000024/1_par_20200812-CardiomyocyteDifferentiation14-Cycle1_zarr_B_11_0__out_DAUqlLdjqOQ8mCj3USCxsCvwYGgcJXMA.pickle &
srun --ntasks=1 --cpus-per-task=$SLURM_CPUS_PER_TASK --mem=4000MB /data/homes/fractal/deployment_joel/fractal-demos/examples/server/fractal-server-1.2.4/bin/python -m fractal_server.app.runner._slurm.remote --input-file /net/nfs4/pelkmanslab-fileserver-fractal/data/homes/fractal/deployment_joel/fractal-demos/examples/server/fractal-logs/proj_0000026_wf_0000025_job_0000024/1_par_20200812-CardiomyocyteDifferentiation14-Cycle1_zarr_C_04_0__in_P5EqiowLi057IIJkWno8KOrVdUZKe3LX.pickle --output-file /net/nfs4/pelkmanslab-fileserver-jluethi/data/homes/jluethi/.fractal-cache/20230516_114559_proj_0000026_wf_0000025_job_0000024/1_par_20200812-CardiomyocyteDifferentiation14-Cycle1_zarr_C_04_0__out_P5EqiowLi057IIJkWno8KOrVdUZKe3LX.pickle &
srun --ntasks=1 --cpus-per-task=$SLURM_CPUS_PER_TASK --mem=4000MB /data/homes/fractal/deployment_joel/fractal-demos/examples/server/fractal-server-1.2.4/bin/python -m fractal_server.app.runner._slurm.remote --input-file /net/nfs4/pelkmanslab-fileserver-fractal/data/homes/fractal/deployment_joel/fractal-demos/examples/server/fractal-logs/proj_0000026_wf_0000025_job_0000024/1_par_20200812-CardiomyocyteDifferentiation14-Cycle1_zarr_C_06_0__in_GL02pdhq6vD6utH1Qi1RlpyjEyAZ4MKS.pickle --output-file /net/nfs4/pelkmanslab-fileserver-jluethi/data/homes/jluethi/.fractal-cache/20230516_114559_proj_0000026_wf_0000025_job_0000024/1_par_20200812-CardiomyocyteDifferentiation14-Cycle1_zarr_C_06_0__out_GL02pdhq6vD6utH1Qi1RlpyjEyAZ4MKS.pickle &
srun --ntasks=1 --cpus-per-task=$SLURM_CPUS_PER_TASK --mem=4000MB /data/homes/fractal/deployment_joel/fractal-demos/examples/server/fractal-server-1.2.4/bin/python -m fractal_server.app.runner._slurm.remote --input-file /net/nfs4/pelkmanslab-fileserver-fractal/data/homes/fractal/deployment_joel/fractal-demos/examples/server/fractal-logs/proj_0000026_wf_0000025_job_0000024/1_par_20200812-CardiomyocyteDifferentiation14-Cycle1_zarr_C_08_0__in_EyPYbLufBacsRQhC5GzpU2THNu8CZuge.pickle --output-file /net/nfs4/pelkmanslab-fileserver-jluethi/data/homes/jluethi/.fractal-cache/20230516_114559_proj_0000026_wf_0000025_job_0000024/1_par_20200812-CardiomyocyteDifferentiation14-Cycle1_zarr_C_08_0__out_EyPYbLufBacsRQhC5GzpU2THNu8CZuge.pickle &
wait
STDOUT:
File /net/nfs4/pelkmanslab-fileserver-fractal/data/homes/fractal/deployment_joel/fractal-demos/examples/server/fractal-logs/proj_0000026_wf_0000025_job_0000024/1_batch_000000_slurm_9647792.out is empty
STDERR:
File /net/nfs4/pelkmanslab-fileserver-fractal/data/homes/fractal/deployment_joel/fractal-demos/examples/server/fractal-logs/proj_0000026_wf_0000025_job_0000024/1_batch_000000_slurm_9647792.err is empty
ADDITIONAL INFO:
Output pickle file of the FractalSlurmExecutor job not found.
Expected file path: /net/nfs4/pelkmanslab-fileserver-fractal/data/homes/fractal/deployment_joel/fractal-demos/examples/server/fractal-logs/proj_0000026_wf_0000025_job_0000024/1_par_20200812-CardiomyocyteDifferentiation14-Cycle1_zarr_B_03_0__out_yBYOTEKOSAWC2p3Lya03SQ9A7Su72Hmd.pickle.
Here are some possible reasons:
1. The SLURM job was scancel-ed, either by the user or due to an error (e.g. an out-of-memory or timeout error). Note that if the scancel took place before the job started running, the SLURM out/err files will be empty.
2. Some error occurred upon writing the file to disk (e.g. due to an overloaded NFS filesystem). Note that the server configuration has FRACTAL_SLURM_OUTPUT_FILE_GRACE_TIME=4 seconds.
It doesn't specify the out of space option here though, but suggested 2 other scenarios. We could add it to the additional info reasons as 3 (or as part of option 2, it's kind of an overloaded filesystem in the sense that it's just full). If we add it there, than I'd say we can close this issue :)
Currently, when the user fileshare runs out of space, the tasks fail (with errors like the one below), but the server keeps the job status at "running".
Possible that the user isn't writing the slurm output files correctly anymore in this state. Is there a way we can detect this and have the server also show the pipeline as failed with a relevant error? We see the following on the running server when looking at what it prints to console:
Example error for out of storage space for the task: