fractal-analytics-platform / fractal-tasks-core

Main tasks for the Fractal analytics platform
https://fractal-analytics-platform.github.io/fractal-tasks-core/
BSD 3-Clause "New" or "Revised" License
11 stars 5 forks source link

703 fix illum corr no overwrite #709

Closed jluethi closed 2 months ago

jluethi commented 2 months ago

@tcompa I created some new helper functions in tasks/_zarr_utils.py that I used from within illumination_correction to update the zarr metadata.

The part that is tricky, but also relevant for https://github.com/fractal-analytics-platform/fractal-tasks-core/issues/675 is updating well metadata. I have an approach that seems to work in the tests I currently have. The tricky part are potential race conditions in updating the well .zattrs.

I thought I could solve this with FileLock. But if I use the load_NgffWellMeta from within the file-lock part of the code, I get a json.decoder.JSONDecodeError. What do you think about the overall approach to update well metadata? Any idea how we could do it with the load_NgffWellMeta inside the file lock section?


This also: Closes #708 Closes #703

Checklist before merging

github-actions[bot] commented 2 months ago

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  fractal_tasks_core/tasks
  _registration_utils.py
  _zarr_utils.py
  apply_registration_to_image.py
  illumination_correction.py
Project Total  

This report was generated by python-coverage-comment-action

tcompa commented 2 months ago

I thought I could solve this with FileLock. But if I use the load_NgffWellMeta from within the file-lock part of the code, I get a json.decoder.JSONDecodeError. What do you think about the overall approach to update well metadata? Any idea how we could do it with the load_NgffWellMeta inside the file lock section?

https://py-filelock.readthedocs.io was also my first guess for how to solve it. Notice that the docs say:

Don’t use a FileLock to lock the file you want to write to, instead create a separate .lock file as shown above.

And they have this example:

from filelock import Timeout, FileLock

lock = FileLock("high_ground.txt.lock")
with lock:
    with open("high_ground.txt", "a") as f:
        f.write("You were the chosen one.")

while you have

    lock = FileLock(f"{well_url}/.zattrs")

Could this be the reason for the JSONDecodeError? TBD


This comes just from a very quick look, I'll come back to this PR a bit later)

jluethi commented 2 months ago

Could this be the reason for the JSONDecodeError? TBD

Ah yes, that would make sense! Will try that later!

jluethi commented 2 months ago

Ok, fixing things to use .lock correctly seems to do the trick :)

jluethi commented 2 months ago
jluethi commented 2 months ago

This PR would be ready for a review when you get a changes @tcompa

tcompa commented 2 months ago

As a small change, I added a test of the lock-based helper function (ref https://github.com/fractal-analytics-platform/fractal-tasks-core/pull/709/commits/40ba0e80daa010fbe871b9c7f56b2cf1a9103c41).

Here is an example of the test output (after adding some additional logging to the function, in my local version):

===================================================== test session starts =====================================================
platform linux -- Python 3.10.12, pytest-7.4.0, pluggy-1.2.0 -- /home/tommaso/.cache/pypoetry/virtualenvs/fractal-tasks-core-UoMDyr20-py3.10/bin/python
cachedir: .pytest_cache
rootdir: /home/tommaso/Fractal/fractal-tasks-core
plugins: pretty-1.2.0, anyio-4.3.0, napari-0.4.19.post1, napari-plugin-engine-0.2.0, npe2-0.7.5
collected 1 item                                                                                                              

tests/tasks/test_unit_zarr_utils.py::test_update_well_metadata 
-------------------------------------------------------- live log call --------------------------------------------------------
INFO     root:_zarr_utils.py:65 [_update_well_metadata, time=16750.78] Lock acquired for new_image_path='0_new_0'
INFO     root:_zarr_utils.py:86 [_update_well_metadata, time=16751.31] Lock released for new_image_path='0_new_0'

-------------------------------------------------------- live log call --------------------------------------------------------
INFO     root:_zarr_utils.py:65 [_update_well_metadata, time=16751.33] Lock acquired for new_image_path='0_new_2'
INFO     root:_zarr_utils.py:86 [_update_well_metadata, time=16751.86] Lock released for new_image_path='0_new_2'

-------------------------------------------------------- live log call --------------------------------------------------------
INFO     root:_zarr_utils.py:65 [_update_well_metadata, time=16751.89] Lock acquired for new_image_path='0_new_3'
INFO     root:_zarr_utils.py:86 [_update_well_metadata, time=16752.42] Lock released for new_image_path='0_new_3'

-------------------------------------------------------- live log call --------------------------------------------------------
INFO     root:_zarr_utils.py:65 [_update_well_metadata, time=16752.45] Lock acquired for new_image_path='0_new_1'
INFO     root:_zarr_utils.py:86 [_update_well_metadata, time=16752.97] Lock released for new_image_path='0_new_1'
PASSED

====================================================== slowest durations ======================================================
2.24s call     tests/tasks/test_unit_zarr_utils.py::test_update_well_metadata
0.03s setup    tests/tasks/test_unit_zarr_utils.py::test_update_well_metadata
0.00s teardown tests/tasks/test_unit_zarr_utils.py::test_update_well_metadata
Results (3.21s):
         1 passed

These logs show the expected behavior, where only a single new image at a time can be processed.

Also note that the order of the results is not guaranteed any more, since it's on a first-come-first-added-to-the-attributes basis. This means that the final state of the zarr attributes is not guaranteed to have a sorted image list. If this is needed, we can add it explicitly within the _update_well_metadata function.


If I set a timeout of 0.001 seconds, then the lock can be acquired fast enough and the error looks like:

E               filelock._error.Timeout: The file lock '/tmp/pytest-of-tommaso/pytest-33/test_update_well_metadata0/plate.zarr/B/03/.zattrs.lock' could not be acquired.

In principle we can catch this error and raise a more informative one, but I don't think this is very relevant. The function itself, in my tests, runs in less than 0.05 seconds. Therefore the current timeout of 120 seconds should cover a scenario where 2400 functions are submitted for the same well and at the very same time (which seems exceedingly unlikely, in a SLURM setup).

tcompa commented 2 months ago

Two minor changes introduced with https://github.com/fractal-analytics-platform/fractal-tasks-core/pull/709/commits/5ab9a6c7caff3ba7f3b83ffe3d2a6c52265ca416:

  1. Sort images by path, when adding a new one. This guarantees that the final order does not depend on the order of execution (which cannot be guaranteed, since tasks are executed asynchronously in fractal-server). This feature is not critical, I can also revert it if there's any reason.
  2. Raise an appropriate error if the old image is not present in the current metadata. I think this is needed, or it would appear as less clear error (I guess it'd be an IndexError for trying to take element [0] of an empty list).
tcompa commented 2 months ago

Apart from the comments above, this looks good to me.

jluethi commented 2 months ago

Thanks for adding the tests for the FileLock behavior!

Sort images by path, when adding a new one. This guarantees that the final order does not depend on the order of execution (which cannot be guaranteed, since tasks are executed asynchronously in fractal-server). This feature is not critical, I can also revert it if there's any reason.

Neat, thanks for adding this!

jluethi commented 2 months ago

Looks all good to me. Shall we merge?