fractal-analytics-platform / fractal-tasks-core

Main tasks for the Fractal analytics platform
https://fractal-analytics-platform.github.io/fractal-tasks-core/
BSD 3-Clause "New" or "Revised" License
14 stars 6 forks source link

Race condition in Apply Registration to image task #516

Closed jluethi closed 8 months ago

jluethi commented 12 months ago

The Apply Registration to image task can hit a race condition. Because we copy the standard ROI tables from the reference cycle, but also modify the reference cycle, the following happens:

This only happens in the overwrite_input = True scenario. Because as part of that, we create a new Zarr group (e.g. 0_registered). We then remove the old zarr group (0) and rename the registered zarr group to 0.

While the task is running for cycle 0 and is in the process of removing the old zarr group, the task running for a different cycle may want to read the tables from the reference cycle. But that table temporarily doesn't exist. This results in the following error:

Traceback (most recent call last):
  File "/Users/joel/Library/CloudStorage/Dropbox/Joel/BioVisionCenter/Code/fractal/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.11.0a2/venv/lib/python3.9/site-packages/fractal_tasks_core/tasks/apply_registration_to_image.py", line 376, in <module>
    run_fractal_task(
  File "/Users/joel/Library/CloudStorage/Dropbox/Joel/BioVisionCenter/Code/fractal/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.11.0a2/venv/lib/python3.9/site-packages/fractal_tasks_core/tasks/_utils.py", line 79, in run_fractal_task
    metadata_update = task_function(**pars)
  File "pydantic/decorator.py", line 40, in pydantic.decorator.validate_arguments.validate.wrapper_function
    from contextlib import _GeneratorContextManager
  File "pydantic/decorator.py", line 134, in pydantic.decorator.ValidatedFunction.call

  File "pydantic/decorator.py", line 206, in pydantic.decorator.ValidatedFunction.execute

  File "/Users/joel/Library/CloudStorage/Dropbox/Joel/BioVisionCenter/Code/fractal/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.11.0a2/venv/lib/python3.9/site-packages/fractal_tasks_core/tasks/apply_registration_to_image.py", line 221, in apply_registration_to_image
    old_table_group = zarr.open_group(table_dict[table], mode="r")
  File "/Users/joel/Library/CloudStorage/Dropbox/Joel/BioVisionCenter/Code/fractal/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.11.0a2/venv/lib/python3.9/site-packages/zarr/hierarchy.py", line 1532, in open_group
    raise GroupNotFoundError(path)
zarr.errors.GroupNotFoundError: group not found at path ''

I'm creating a PR to make this race condition way less likely:

  1. Instead of removing the old group first (slow call), then renaming the new group to the old name, I'll now do the following: Rename the old group to group_name_tmp (e.g. (0_tmp). Then rename the new group to the old group name (e.g. from 0_registered to 0). Both those renamings are very fast. Only after that, remove the tmp folder.
  2. The opening of the old zarr group gets a try except to catch the zarr.errors.GroupNotFoundError and would try again 5 seconds later.

This should make it very unlikely to hit the issue. But the real solution would be to remove the need to load something from the reference cycle in a task that also runs in parallel on said reference cycle (basically, find a way to generate the new, correct ROI tables from the cycle itself).

adrtsc commented 8 months ago

Hi @jluethi,

I already talked to you about this, but this issue was not fixed yet for me. For reference, here again the behaviour I observed:

When the task failed, I got this error message:

Traceback (most recent call last):
  File "/net/nfs4/pelkmanslab-fileserver-fractal/data/homes/fractal/20230627_joel_fractal/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.14.0/venv/lib/python3.10/site-packages/fractal_tasks_core/tasks/apply_registration_to_image.py", line 213, in apply_registration_to_image
    old_table_group = zarr.open_group(table_dict[table], mode="r")
  File "/data/homes/fractal/20230627_joel_fractal/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.14.0/venv/lib/python3.10/site-packages/zarr/hierarchy.py", line 1532, in open_group
    raise GroupNotFoundError(path)
zarr.errors.GroupNotFoundError: group not found at path ''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/net/nfs4/pelkmanslab-fileserver-fractal/data/homes/fractal/20230627_joel_fractal/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.14.0/venv/lib/python3.10/site-packages/fractal_tasks_core/tasks/apply_registration_to_image.py", line 377, in <module>
    run_fractal_task(
  File "/data/homes/fractal/20230627_joel_fractal/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.14.0/venv/lib/python3.10/site-packages/fractal_tasks_core/tasks/_utils.py", line 79, in run_fractal_task
    metadata_update = task_function(**pars)
  File "pydantic/decorator.py", line 40, in pydantic.decorator.validate_arguments.validate.wrapper_function
  File "pydantic/decorator.py", line 134, in pydantic.decorator.ValidatedFunction.call
  File "pydantic/decorator.py", line 206, in pydantic.decorator.ValidatedFunction.execute
  File "/net/nfs4/pelkmanslab-fileserver-fractal/data/homes/fractal/20230627_joel_fractal/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.14.0/venv/lib/python3.10/site-packages/fractal_tasks_core/tasks/apply_registration_to_image.py", line 216, in apply_registration_to_image
    old_table_group = zarr.open_group(table_dict[table], mode="r")
  File "/data/homes/fractal/20230627_joel_fractal/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal/fractal-tasks-core0.14.0/venv/lib/python3.10/site-packages/zarr/hierarchy.py", line 1532, in open_group
    raise GroupNotFoundError(path)
zarr.errors.GroupNotFoundError: group not found at path ''

My workaround at the moment is to put the group loading/sleep code section into a while loop. This task then runs through and I haven't had any issues with it so far.

for table in table_dict.keys():
    logger.info(f"Copying table: {table}")
    # Get the relevant metadata of the Zarr table & add it
    # See issue #516 for the need for this workaround
    checkpoint = 0
    while checkpoint == 0:
        try:
            old_table_group = zarr.open_group(table_dict[table], mode="r")
            checkpoint += 1
        except zarr.errors.GroupNotFoundError:
            logger.warning(
                f"Table {table} not found yet. Waiting 5 seconds "
                "before trying again"
            )
            time.sleep(5)
jluethi commented 8 months ago

Thanks for reporting this @adrtsc !

I'm proposing a slightly different fix here. Let me know if you agree with this approach, also open to change the max_retries parameter. https://github.com/fractal-analytics-platform/fractal-tasks-core/pull/638