Open tcompa opened 1 year ago
I'm reproducible getting this error when I run a workflow with 1GB of RAM for the create ome zarr on a large dataset (but not when running on a small dataset):
JOB ERROR:
TRACEBACK:
JobExecutionError
Task failed with returncode=-9
When I increase to 4GB, it runs through.
It triggers after a few seconds of running the task.
What additional info would be useful here?
Ideas to reproduce this: Can we have a stress-test version that uses more RAM than a) the node has available b) that is being requested
and see how they fail?
For the record, I think I observed something similar at UZH.
I ran the create-ome-zarr on the 23-wells dataset, and the task failed with a TaskExecutionError and this not-very-informative traceback:
/data/homes/fractal/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.9.1/venv/lib/python3.9/site-packages/anndata/experimental/pytorch/_annloader.py:18: UserWarning: Сould not load pytorch.
warnings.warn("Сould not load pytorch.")
2023-04-11 12:32:30,597; INFO; START create_ome_zarr task
2023-04-11 12:32:30,597; INFO; [glob_with_multiple_patterns] patterns=['*.png', '*F001*']
2023-04-11 12:32:31,396; INFO; [glob_with_multiple_patterns] Found 2271 items
2023-04-11 12:32:31,437; INFO; Creating 20200812-CardiomyocyteDifferentiation14-Cycle1.zarr
Traceback (most recent call last):
File "/net/nfs4/pelkmanslab-fileserver-common/data/homes/fractal/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.9.1/venv/lib/python3.9/site-packages/fractal_tasks_core/create_ome_zarr.py", line 448, in <module>
run_fractal_task(
File "/data/homes/fractal/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.9.1/venv/lib/python3.9/site-packages/fractal_tasks_core/_utils.py", line 91, in run_fractal_task
metadata_update = task_function(**task_args.dict(exclude_unset=True))
File "/net/nfs4/pelkmanslab-fileserver-common/data/homes/fractal/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.9.1/venv/lib/python3.9/site-packages/fractal_tasks_core/create_ome_zarr.py", line 224, in create_ome_zarr
site_metadata, number_images_mlf = parse_yokogawa_metadata(
File "/data/homes/fractal/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.9.1/venv/lib/python3.9/site-packages/fractal_tasks_core/lib_metadata_parsing.py", line 52, in parse_yokogawa_metadata
mrf_frame, mlf_frame, error_count = read_metadata_files(
File "/data/homes/fractal/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.9.1/venv/lib/python3.9/site-packages/fractal_tasks_core/lib_metadata_parsing.py", line 148, in read_metadata_files
mlf_frame, error_count = read_mlf_file(mlf_path, filename_patterns)
File "/data/homes/fractal/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.9.1/venv/lib/python3.9/site-packages/fractal_tasks_core/lib_metadata_parsing.py", line 212, in read_mlf_file
mlf_frame_raw = pd.read_xml(mlf_path)
File "/data/homes/fractal/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.9.1/venv/lib/python3.9/site-packages/pandas/util/_decorators.py", line 331, in wrapper
return func(*args, **kwargs)
File "/data/homes/fractal/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.9.1/venv/lib/python3.9/site-packages/pandas/io/xml.py", line 1088, in read_xml
return _parse(
File "/data/homes/fractal/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.9.1/venv/lib/python3.9/site-packages/pandas/io/xml.py", line 827, in _parse
data_dicts = p.parse_data()
File "/data/homes/fractal/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.9.1/venv/lib/python3.9/site-packages/pandas/io/xml.py", line 551, in parse_data
self.xml_doc = self._parse_doc(self.path_or_buffer)
File "/data/homes/fractal/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.9.1/venv/lib/python3.9/site-packages/pandas/io/xml.py", line 636, in _parse_doc
doc = fromstring(
File "src/lxml/etree.pyx", line 3257, in lxml.etree.fromstring
File "src/lxml/parser.pxi", line 1916, in lxml.etree._parseMemoryDocument
File "src/lxml/parser.pxi", line 1803, in lxml.etree._parseDoc
File "src/lxml/parser.pxi", line 1144, in lxml.etree._BaseParser._parseDoc
File "src/lxml/parser.pxi", line 618, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 728, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 657, in lxml.etree._raiseParseError
File "<string>", line 37159
lxml.etree.XMLSyntaxError: unknown error, line 37159, column 269
I then asked for 2G of memory, and got a JobExecutionError like
This is the job log:
JOB ERROR:
TRACEBACK:
JobExecutionError
COMMAND:
Content of /net/nfs4/pelkmanslab-fileserver-common/data/homes/fractal/fractal-demos/examples/server/artifacts/proj_0000008_wf_0000008_job_0000008/0_slurm_submit.sbatch:
#!/bin/sh
#SBATCH --partition=main
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=2000M
#SBATCH --job-name=Create_OME-Zarr_structure
#SBATCH --err=/net/nfs4/pelkmanslab-fileserver-test01/data/homes/test01/fractal-demos/examples/cache/20230411_123927_proj_0000008_wf_0000008_job_0000008/0_slurm_%j.err
#SBATCH --out=/net/nfs4/pelkmanslab-fileserver-test01/data/homes/test01/fractal-demos/examples/cache/20230411_123927_proj_0000008_wf_0000008_job_0000008/0_slurm_%j.out
export CELLPOSE_LOCAL_MODELS_PATH=/data/homes/test01/.cache/CELLPOSE_LOCAL_MODELS_PATH
export NUMBA_CACHE_DIR=/data/homes/test01/.cache/NUMBA_CACHE_DIR
export NAPARI_CONFIG=/data/homes/test01/.cache/napari_config.json
export XDG_CONFIG_HOME=/data/homes/test01/.cache/XDG_CONFIG
export XDG_CACHE_HOME=/data/homes/test01/.cache/XDG
srun --ntasks=1 --cpus-per-task=$SLURM_CPUS_PER_TASK --mem=2000MB /data/homes/fractal/anaconda3/envs/fractal-server-1.2.0a3/bin/python -m fractal_server.app.runner._slurm.remote --input-file /net/nfs4/pelkmanslab-fileserver-common/data/homes/fractal/fractal-demos/examples/server/artifacts/proj_0000008_wf_0000008_job_0000008/0_in_1lchwgPleRWTNJUYkdHKiRth8enpesw0.pickle --output-file /net/nfs4/pelkmanslab-fileserver-test01/data/homes/test01/fractal-demos/examples/cache/20230411_123927_proj_0000008_wf_0000008_job_0000008/0_out_1lchwgPleRWTNJUYkdHKiRth8enpesw0.pickle &
wait
STDOUT:
File /net/nfs4/pelkmanslab-fileserver-common/data/homes/fractal/fractal-demos/examples/server/artifacts/proj_0000008_wf_0000008_job_0000008/0_slurm_9609579.out is empty
STDERR:
File /net/nfs4/pelkmanslab-fileserver-common/data/homes/fractal/fractal-demos/examples/server/artifacts/proj_0000008_wf_0000008_job_0000008/0_slurm_9609579.err is empty
ADDITIONAL INFO:
Task failed with returncode=-11
I then asked for 4G of memory, and the task ran through.
The XML file for the full 23 well example is waaaay bigger than the tiny examples. So it's not unreasonable that they could be a bit more memory hungry. And if that's the case, the 23 well example is probably close to an upper bound of xml sizes we'd normally hit. It's not many wells, but imaging for ~14h, something on the order of a million image (=> a million lines in the xml file). Thus, we may want to adjust the default memory to be something like 4G for the Create OME Zarr task as well, after all.
Would be great if we fail in a way that makes this more obvious though that it's a memory error!
With the new stress-test tasks, we can now require a certain amount of both memory and running time for a given task. Let's use this to better understand how this kind of OOM errors appear, and whether the fact that they are not caught is Fractal or SLURM responsibility.
Questions:
-9
and -11
, but I don't know whether this is robust.Based on the info above, we may or may not improve our catching of these OOM errors. Independently on this, let's at least improve the JobExecutionError logs by mentioning what is always a possible solution hint (try increasing memory).
@jluethi observed some error different from https://github.com/fractal-analytics-platform/fractal-server/issues/343