gerlichlab / looptrace

Fork (from EMBL Gitlab) of the looptrace project: https://git.embl.de/grp-ellenberg/looptrace
MIT License
2 stars 1 forks source link

We need more stable, sustainable conversion of ND2 to ZARR #321

Open vreuter opened 1 month ago

vreuter commented 1 month ago

Experienced with at least two different datasets by @TLSteinacker . The workaround is to run the ZARR conversion with v0.3.1, and then proceed with v0.4.0 for the rest of processing, but this is hardly ideal.

After some investigation, this boils down to a difference between zarr 2.17.2 (in our v0.3.1 image) and 2.18.0 (in our v0.4.0 image). Specifically, this commit: https://github.com/zarr-developers/zarr-python/commit/9d046ea0d2878af7d15b3de3ec3036fe31661340

The additional Boolean expression added to the conditional in that commit means that the slice we're using now does qualify as zarr.util.is_total_slice, so we walk a different branch of a data preparatory function call in zarr: https://github.com/zarr-developers/zarr-python/blob/056657ca5ed70aa3d77a9e2db42253fca39800b0/zarr/core.py#L2269-L2308

We therefore wind up getting back a Dask Array rather than a Numpy ndarray, and therefore when ensure_numpy_array from numcodecs is called (https://github.com/zarr-developers/numcodecs/blob/main/numcodecs/compat.py#L13-L45), we do enter into the positive branch of the conditional, as not is_ndarray_like(buf) evaluates to True when we pass our Dask Array (whereas it's False with zarr 2.17.2, where we're passing a Numpy array. This is with numcodecs v0.11.0.

For v0.4.0 of looptrace, the simplest fix for this is to upper-bound our zarr dependency at 2.17.2; this is OK, as 2.17.2 is not very old and 2.18.0 was released only a couple of weeks ago. Furthermore, v0.4.0 will quickly be replaced by 0.5.0, so I don't envision long-term issues with second-order dependency conflicts, as the period of use of v0.4.0 will be relatively limited.

For v0.5.0, however, we will need to more thoroughly address this:

vreuter commented 1 month ago

For v0.5.0, probably we can just compute() the array: https://docs.dask.org/en/stable/generated/dask.array.Array.compute.html#dask.array.Array.compute

This seems to be what was being done anyway, or at least zarr had previously been manifesting an in-memory version of the underyling data...

looptrace_old__zarr_old__conversion_good