Debug/fix filelock test on macos

tcompa commented 2 months ago

The following test sometimes fails on macos


def test_update_well_metadata_concurrency(
    tmp_path: Path,
    testdata_path: Path,
    monkeypatch,
):
    """
    Run _update_well_metadata in parallel for adding N>1 new images to a given
    well. We artificially slow down each call by INTERVAL seconds, and verify
    that the test takes at least N x INTERVAL seconds (since each call to
    `_update_well_metadata` is blocking).

    In the last section of the test, we verify that a timeout error is raised
    when the timeout is too short.
    """

    N = 4
    INTERVAL = 0.5

    # Copy a reference zarr into a temporary folder
    raw_zarrurl = (testdata_path / "plate_ones.zarr").as_posix()
    zarr_url = (tmp_path / "plate.zarr").resolve().as_posix()
    shutil.copytree(raw_zarrurl, zarr_url)

    # Artificially slow down `_update_well_metadata`
    import fractal_tasks_core.tasks._zarr_utils

    def _slow_load_NgffWellMeta(*args, **kwargs):
        time.sleep(INTERVAL)
        return load_NgffWellMeta(*args, **kwargs)

    monkeypatch.setattr(
        fractal_tasks_core.tasks._zarr_utils,
        "load_NgffWellMeta",
        _slow_load_NgffWellMeta,
    )

    # Prepare parallel-execution argument list
    well_url = Path(zarr_url, "B/03").as_posix()
    list_args = [(well_url, "0", f"0_new_{suffix}") for suffix in range(N)]

    # Run `_update_well_metadata` N times
    time_start = time.perf_counter()
    executor = ProcessPoolExecutor()
    res_iter = executor.map(_star_update_well_metadata, list_args)
    list(res_iter)  # This is needed, to wait for all results.
    time_end = time.perf_counter()

    # Check that time was at least N*INTERVAL seconds
    assert (time_end - time_start) > N * INTERVAL

    # Check that all new images were added
    well_meta = load_NgffWellMeta(well_url)
    well_image_paths = [img.path for img in well_meta.well.images]
    debug(well_image_paths)
    assert well_image_paths == [
        "0",
        "0_new_0",
        "0_new_1",
        "0_new_2",
        "0_new_3",
    ]

    # Prepare parallel-execution argument list with short timeout
    well_url = Path(zarr_url, "B/03").as_posix()
    list_args = [
        (well_url, "0", f"0_new_{suffix}", INTERVAL / 100)
        for suffix in range(N, 2 * N)
    ]
    with pytest.raises(Timeout) as e:
        res_iter = executor.map(_star_update_well_metadata, list_args)
        list(res_iter)  # This is needed, to wait for all results.
    debug(e.value)

tcompa commented 2 months ago

macos defaults to spawn, in multiprocessing, rather than fork (see docs)
spawn doesn't play well with monkeypatch (ref https://github.com/pytest-dev/pytest/issues/12045)
on macos, the test is not actually patching the load-ngff function, and then each function just runs normally (with no artificial slowdown) -> I verified this is the case in the logs.. the function should have taken at least 0.5 seconds, but was running in less than 0.1 seconds.
the outcome is somewhat arbitrary, in terms of whether the timeout error is raised -> this is now understood

Fix: always use fork, in tests. Note that this is not available on windows, but we don't run tests on windows.

(thanks @mfranzon)

jluethi commented 2 months ago

What a fun rabbithole, thanks for exploring & solving into this!

fractal-analytics-platform / fractal-tasks-core

Debug/fix filelock test on macos #713