fractal-analytics-platform / fractal-server

Fractal backend
https://fractal-analytics-platform.github.io/fractal-server/
BSD 3-Clause "New" or "Revised" License
10 stars 3 forks source link

Out of memory errors #599

Open tcompa opened 1 year ago

tcompa commented 1 year ago

@jluethi observed some error different from https://github.com/fractal-analytics-platform/fractal-server/issues/343

jluethi commented 1 year ago

I'm reproducible getting this error when I run a workflow with 1GB of RAM for the create ome zarr on a large dataset (but not when running on a small dataset):

JOB ERROR:
TRACEBACK:
JobExecutionError

Task failed with returncode=-9

When I increase to 4GB, it runs through.

It triggers after a few seconds of running the task.

What additional info would be useful here?

Ideas to reproduce this: Can we have a stress-test version that uses more RAM than a) the node has available b) that is being requested

and see how they fail?

tcompa commented 1 year ago

For the record, I think I observed something similar at UZH.

I ran the create-ome-zarr on the 23-wells dataset, and the task failed with a TaskExecutionError and this not-very-informative traceback:

/data/homes/fractal/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.9.1/venv/lib/python3.9/site-packages/anndata/experimental/pytorch/_annloader.py:18: UserWarning: Сould not load pytorch.
  warnings.warn("Сould not load pytorch.")
2023-04-11 12:32:30,597; INFO; START create_ome_zarr task
2023-04-11 12:32:30,597; INFO; [glob_with_multiple_patterns] patterns=['*.png', '*F001*']
2023-04-11 12:32:31,396; INFO; [glob_with_multiple_patterns] Found 2271 items
2023-04-11 12:32:31,437; INFO; Creating 20200812-CardiomyocyteDifferentiation14-Cycle1.zarr
Traceback (most recent call last):
  File "/net/nfs4/pelkmanslab-fileserver-common/data/homes/fractal/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.9.1/venv/lib/python3.9/site-packages/fractal_tasks_core/create_ome_zarr.py", line 448, in <module>
    run_fractal_task(
  File "/data/homes/fractal/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.9.1/venv/lib/python3.9/site-packages/fractal_tasks_core/_utils.py", line 91, in run_fractal_task
    metadata_update = task_function(**task_args.dict(exclude_unset=True))
  File "/net/nfs4/pelkmanslab-fileserver-common/data/homes/fractal/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.9.1/venv/lib/python3.9/site-packages/fractal_tasks_core/create_ome_zarr.py", line 224, in create_ome_zarr
    site_metadata, number_images_mlf = parse_yokogawa_metadata(
  File "/data/homes/fractal/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.9.1/venv/lib/python3.9/site-packages/fractal_tasks_core/lib_metadata_parsing.py", line 52, in parse_yokogawa_metadata
    mrf_frame, mlf_frame, error_count = read_metadata_files(
  File "/data/homes/fractal/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.9.1/venv/lib/python3.9/site-packages/fractal_tasks_core/lib_metadata_parsing.py", line 148, in read_metadata_files
    mlf_frame, error_count = read_mlf_file(mlf_path, filename_patterns)
  File "/data/homes/fractal/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.9.1/venv/lib/python3.9/site-packages/fractal_tasks_core/lib_metadata_parsing.py", line 212, in read_mlf_file
    mlf_frame_raw = pd.read_xml(mlf_path)
  File "/data/homes/fractal/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.9.1/venv/lib/python3.9/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/data/homes/fractal/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.9.1/venv/lib/python3.9/site-packages/pandas/io/xml.py", line 1088, in read_xml
    return _parse(
  File "/data/homes/fractal/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.9.1/venv/lib/python3.9/site-packages/pandas/io/xml.py", line 827, in _parse
    data_dicts = p.parse_data()
  File "/data/homes/fractal/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.9.1/venv/lib/python3.9/site-packages/pandas/io/xml.py", line 551, in parse_data
    self.xml_doc = self._parse_doc(self.path_or_buffer)
  File "/data/homes/fractal/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.9.1/venv/lib/python3.9/site-packages/pandas/io/xml.py", line 636, in _parse_doc
    doc = fromstring(
  File "src/lxml/etree.pyx", line 3257, in lxml.etree.fromstring
  File "src/lxml/parser.pxi", line 1916, in lxml.etree._parseMemoryDocument
  File "src/lxml/parser.pxi", line 1803, in lxml.etree._parseDoc
  File "src/lxml/parser.pxi", line 1144, in lxml.etree._BaseParser._parseDoc
  File "src/lxml/parser.pxi", line 618, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 728, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 657, in lxml.etree._raiseParseError
  File "<string>", line 37159
lxml.etree.XMLSyntaxError: unknown error, line 37159, column 269

I then asked for 2G of memory, and got a JobExecutionError like

This is the job log:

JOB ERROR:
TRACEBACK:
JobExecutionError

COMMAND:
Content of /net/nfs4/pelkmanslab-fileserver-common/data/homes/fractal/fractal-demos/examples/server/artifacts/proj_0000008_wf_0000008_job_0000008/0_slurm_submit.sbatch:
#!/bin/sh
#SBATCH --partition=main
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=2000M
#SBATCH --job-name=Create_OME-Zarr_structure
#SBATCH --err=/net/nfs4/pelkmanslab-fileserver-test01/data/homes/test01/fractal-demos/examples/cache/20230411_123927_proj_0000008_wf_0000008_job_0000008/0_slurm_%j.err
#SBATCH --out=/net/nfs4/pelkmanslab-fileserver-test01/data/homes/test01/fractal-demos/examples/cache/20230411_123927_proj_0000008_wf_0000008_job_0000008/0_slurm_%j.out
export CELLPOSE_LOCAL_MODELS_PATH=/data/homes/test01/.cache/CELLPOSE_LOCAL_MODELS_PATH
export NUMBA_CACHE_DIR=/data/homes/test01/.cache/NUMBA_CACHE_DIR
export NAPARI_CONFIG=/data/homes/test01/.cache/napari_config.json
export XDG_CONFIG_HOME=/data/homes/test01/.cache/XDG_CONFIG
export XDG_CACHE_HOME=/data/homes/test01/.cache/XDG

srun --ntasks=1 --cpus-per-task=$SLURM_CPUS_PER_TASK --mem=2000MB /data/homes/fractal/anaconda3/envs/fractal-server-1.2.0a3/bin/python -m fractal_server.app.runner._slurm.remote --input-file /net/nfs4/pelkmanslab-fileserver-common/data/homes/fractal/fractal-demos/examples/server/artifacts/proj_0000008_wf_0000008_job_0000008/0_in_1lchwgPleRWTNJUYkdHKiRth8enpesw0.pickle --output-file /net/nfs4/pelkmanslab-fileserver-test01/data/homes/test01/fractal-demos/examples/cache/20230411_123927_proj_0000008_wf_0000008_job_0000008/0_out_1lchwgPleRWTNJUYkdHKiRth8enpesw0.pickle &
wait

STDOUT:
File /net/nfs4/pelkmanslab-fileserver-common/data/homes/fractal/fractal-demos/examples/server/artifacts/proj_0000008_wf_0000008_job_0000008/0_slurm_9609579.out is empty

STDERR:
File /net/nfs4/pelkmanslab-fileserver-common/data/homes/fractal/fractal-demos/examples/server/artifacts/proj_0000008_wf_0000008_job_0000008/0_slurm_9609579.err is empty

ADDITIONAL INFO:
Task failed with returncode=-11

I then asked for 4G of memory, and the task ran through.

jluethi commented 1 year ago

The XML file for the full 23 well example is waaaay bigger than the tiny examples. So it's not unreasonable that they could be a bit more memory hungry. And if that's the case, the 23 well example is probably close to an upper bound of xml sizes we'd normally hit. It's not many wells, but imaging for ~14h, something on the order of a million image (=> a million lines in the xml file). Thus, we may want to adjust the default memory to be something like 4G for the Create OME Zarr task as well, after all.

Would be great if we fail in a way that makes this more obvious though that it's a memory error!

tcompa commented 1 year ago

With the new stress-test tasks, we can now require a certain amount of both memory and running time for a given task. Let's use this to better understand how this kind of OOM errors appear, and whether the fact that they are not caught is Fractal or SLURM responsibility.

Questions:

  1. What SLURM/task/cgroup variable encodes the frequency of polling memory usage? What is its value (at UZH and/or FMI)? I'm not sure of whether this will be useful info.
  2. Do we consistently observe some return codes? For the moment we did observe -9 and -11, but I don't know whether this is robust.
  3. Do we consistently observe the same error (Job vs TaskExecutionError)? I think I did observe both, at UZH, but I'll need to try again.
  4. Do we ever observe some actual message in SLURM stderr files (as we did in #343)? If so, when?

Based on the info above, we may or may not improve our catching of these OOM errors. Independently on this, let's at least improve the JobExecutionError logs by mentioning what is always a possible solution hint (try increasing memory).