fractal-analytics-platform / fractal-client

Command-line client for Fractal
https://fractal-analytics-platform.github.io/fractal-client
BSD 3-Clause "New" or "Revised" License
45 stars 1 forks source link

Generalize pyramid creation #32

Closed jluethi closed 2 years ago

jluethi commented 2 years ago

Currently, 4 pyramid levels are hard-coded. Depending on the use-case, more pyramid levels could make sense, we probably want to expose this as an option.

Also, if one tries to calculate pyramids with images that have dimensions that aren't divisible by 2 anymore, it fails. For example, with a 2560x2160 image, it can do 4 levels of pyramids, but fails on level 5 with this error:

Traceback (most recent call last):
  File "/net/nfs4/pelkmanslab-fileserver-jluethi/data/homes/jluethi/mwe_fractal/fractal/tasks/yokogawa_to_zarr.py", line 197, in <module>
    yokogawa_to_zarr(
  File "/net/nfs4/pelkmanslab-fileserver-jluethi/data/homes/jluethi/mwe_fractal/fractal/tasks/yokogawa_to_zarr.py", line 107, in yokogawa_to_zarr
    f5_matrix = [da.coarsen(np.min, x, {0: 2, 1: 2}) for x in f4_matrix]
  File "/net/nfs4/pelkmanslab-fileserver-jluethi/data/homes/jluethi/mwe_fractal/fractal/tasks/yokogawa_to_zarr.py", line 107, in <listcomp>
    f5_matrix = [da.coarsen(np.min, x, {0: 2, 1: 2}) for x in f4_matrix]
  File "/data/homes/jluethi/.conda/envs/fractal/lib/python3.10/site-packages/dask/array/routines.py", line 2276, in coarsen
    raise ValueError(msg)
ValueError: Coarsening factors {0: 2, 1: 2} do not align with array shape (1215, 1280).

Is there a general way we could handle this that works well? e.g. pad with 0s? Stretch? Or something similar?

jluethi commented 2 years ago

We could probably try with the trim_excess=True option of dask coarsening to check if that works well enough :)

mfranzon commented 2 years ago

@jluethi good idea, I will have a try with this.

tcompa commented 2 years ago

I'll take care of this.

The basic plan is to modify yokogawa_to_zarr by

jluethi commented 2 years ago

@tcompa Very good generalizations!

Will you also include a compression option that can be on by default (see https://github.com/fractal-analytics-platform/mwe_fractal/issues/28#issuecomment-1118735004 where you linked to here)? Not compression as in coarsening, but as in this here, e.g. using Blobsc: https://zarr.readthedocs.io/en/stable/tutorial.html#compressors

Also, for the pyramid building: For our use-cases, only xy downsampling seems reasonable at the moment. Maybe there will be a need for also downsampling along z at some point in the future. If it's easy to generalize this already with defaulting to not downsample in Z when building the pyramid, that would be useful. Otherwise, we worry about this if it will ever actually be needed :)

jluethi commented 2 years ago

Reading the documentation more carefully, it sounds like Zarr actually uses the blosc compressor by default, so the issue 28 wasn't that files weren't compressed and it was just the pyramid overhead. In that case, we can close 28. We likely won't need to go into compressors and their options much, the default is likely a good starting place

mfranzon commented 2 years ago

Just to add few information on this; at the moment the compression algorithm is the default -> Blosc. The metadata associated with the zarr arrays are :

{
    [...],

    "compressor": {
        "blocksize": 0,
        "clevel": 5,
        "cname": "lz4",
        "id": "blosc",
        "shuffle": 1
    },
    "dimension_separator": "/",
    "dtype": "<u2",
    "fill_value": 0,
    "filters": null,
    "order": "C",

    [...],
    "zarr_format": 2
}
jluethi commented 2 years ago

@tcompa Another thing we could think about when generalizing the pyramid creation: How hard would it be to allow for arbitrary chunking? e.g., at the moment we chunk by original TIFF/PNG image (simplest implementation). I'm starting to wonder whether we'd get significant performance gains when zooming around and loading parts if we had smaller chunks, e.g. 216x256 chunks instead of 2160x2560 chunks. https://zarr.readthedocs.io/en/stable/tutorial.html#chunk-optimizations

Depending on how much work it would be to implement this, we could also use a tool like rechunker to test whether it has a significant performance impact. What's your impression of this?

tcompa commented 2 years ago

I just pushed 2043a4351603334f93a4264b29f7b04b404eae9e (still in the dev-newcmd branch), with commit message:

Update yokogawa_to_zarr.py 

* Set trim_excess=True in da.coarsen;
* Make coarsening_factor an (optional) argument;
* Make num_levels an (optional) argument;
* Fix typo in docstring (delete_in type).

Note that by now the new parameters (coarsening_factor and num_levels) are not passed (yet) through fractal_cmd.py. This is something I'll fix soon, and I'll also look at the other ideas (coarsening along z, and arbitrary chunking - I'll have a look at that).

As a test, I played with the two parameters coarsening_factor and num_levels (using regression_test.sh), and I confirm that the size of the result folder varies as it should.

num_levels=1 -> 318 M
num_levels=2, coarsening_factor=2 -> 397 M [expected: 318 M * (1+1/4) = 398 M]
num_levels=3, coarsening_factor=3 -> 418 M [expected: 318 M * (1+1/4+1/16) = 417 M]
num_levels=5, coarsening_factor=2 -> 425 M [expected: 318 M * (1+1/4+1/16+1/64+1/256) = 424 M]
num_levels=5, coarsening_factor=3 -> 359 M [expected: 318 M * (1+1/9+1/81+1/729+1/6561) = 358 M]
num_levels=5, coarsening_factor=4 -> 341 M [expected: 318 M * (1+1/16+1/256+1/4096+1/65536) = 339 M]

Note that the original file sizes for this example are

$ du -ch /data/active/fractal-temp/3D/PelkmansLab/CardiacMultiplexing/Cycle1_subset_extras/*_B03_*C0{1,2,3}.png | tail -n1
1.6G    total
tcompa commented 2 years ago

To do before closing this issue:

jluethi commented 2 years ago

Hey @tcompa This sounds great, looking forward to testing it, especially the increased number of pyramid levels!

Small concern on the file sizes: If the original data is 1.6GB, I would expect the zarr file to be very similar in size (the original images are already compressed). Also, /data/active/fractal-temp/3D/PelkmansLab/CardiacMultiplexing/Cycle1_subset_extras/ is probably not a good general test set. Cycle1 & Cycle1_subset are good test sets, Cycle1_subset_extras are just spare files and e.g. have site numbers that don't start at 1. So maybe there is a bit of a parsing issue there?

jluethi commented 2 years ago

New test dataset that would be better to use for this: /data/active/fractal-temp/3D/PelkmansLab/CardiacMultiplexing/Cycle1_testSubset It contains 2x2 wells and only 10 z stacks (main part that has cells in the images) => should run faster than the old test case Plus, it contains the top left corner of the Cycle1 test data, renamed so that the site numbers fit a 2x2 scheme

tcompa commented 2 years ago

3d77935c29b782c1174bba0df5f93a938a5bc6c5 updates the pyramid creation

tcompa commented 2 years ago

f13c94a1e94ef9dbb9416f338621062628c601e9 updates the way we chunk the arrays at each level of the pyramid, by adding the standard rechunk from dask (@jluethi we haven't looked into rechunker yet).

At the moment, we hard-coded the chunksize to be

    chunk_size_x = 1280
    chunk_size_y = 1080

for the highest-resolution level, while for the other levels it scales as chunk_size_x/coarsening_factor_xy**level. This means that chunks are two times smaller than the ones we were using earlier. Of course we can play with these values (Joel was suggesting a 10x reduction, for instance), but at the moment we thought that it's best not to expose them to the final user.

tcompa commented 2 years ago

Small concern on the file sizes: If the original data is 1.6GB, I would expect the zarr file to be very similar in size (the original images are already compressed). Also, /data/active/fractal-temp/3D/PelkmansLab/CardiacMultiplexing/Cycle1_subset_extras/ is probably not a good general test set.

(from https://github.com/fractal-analytics-platform/mwe_fractal/issues/32#issuecomment-1129692260)

This behavior changed when changing data, Now we are using /data/active/fractal-temp/3D/PelkmansLab/CardiacMultiplexing/Cycle1_testSubset, where the images have size

$ du -ch /data/active/fractal-temp/3D/PelkmansLab/CardiacMultiplexing/Cycle1_testSubset/*_B03_*C0{1,2,3}.png | tail -n1
678M    total

If I now run yokogawa_to_zarr (with num_levels=1 and coarsening_factor_z=1), the zarr size is 694 M (note that this size also depends on the chunking - more on this later). This is in line with expectations by @jluethi that the original images were already reasonably well compressed.

jluethi commented 2 years ago

Of course we can play with these values (Joel was suggesting a 10x reduction, for instance), but at the moment we thought that it's best not to expose them to the final user.

I don't really know what optimal chunking would be at the moment, 10x was just to illustrate the chunking. I've recently read that chunking in a way that each chunk is ~1Mb is very reasonable for Zarr, so that chunking you're currently doing may be more appropriate than 10x reductions :)

And glad to see those file sizes, that sounds reassuring!

I am testing the increased pyramid levels with the new chunking for the large 23 well dataset now, that runs for some 4h or so.

tcompa commented 2 years ago

Indeed, I'm trying to see what the effect of different chunking is. Before evaluating napari performance as a function of chunk-size, I just looked at the zarr file and its size.

As an example, I'm using a fractal_cmd.py with parameters

                "dims": "2,2",
                "num_levels": 5,
                "coarsening_xy": 3,
                "coarsening_z": 1,

and then I play with the chunk_size_x and chunk_size_y parameters in yokogawa_to_zarr.

In the "extreme" case of a x10 reduction (chunk_size_x=256, chunk_size_y=216), the chunk structure is very complex

Chunks at level 0:
 ((1, 1, 1), (1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 240), (214, 214, 214, 214, 214, 214, 214, 214, 214, 214, 214, 214, 214, 214, 214, 214, 214, 214, 214, 214, 214, 214, 214, 198))
Chunks at level 1:
 ((1, 1, 1), (1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (85, 85, 85, 85, 85, 85, 85, 85, 85, 85, 85, 85, 85, 85, 85, 85, 80), (72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 50))
Chunks at level 2:
 ((1, 1, 1), (1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 16), (24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 16))
Chunks at level 3:
 ((1, 1, 1), (1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 7), (8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 5))
Chunks at level 4:
 ((1, 1, 1), (1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2), (2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1))

and the level sizes in the zarr file are:

998M    B/03/0   # <-- total size
724M    B/03/0/0
102M    B/03/0/1
50M     B/03/0/2
53M     B/03/0/3
70M     B/03/0/4

Notice the weird fact that the size of low-resolution levels is increasing!

If I now take a less extreme chunking (x2 instead of x10), the chunk structure is

Chunks at level 0:
 ((1, 1, 1), (1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (1440, 1440, 1440), (1024, 1024, 1024, 1024, 1024))
Chunks at level 1:
 ((1, 1, 1), (1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (480, 480, 480), (342, 342, 342, 342, 338))
Chunks at level 2:
 ((1, 1, 1), (1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (160, 160, 160), (114, 114, 114, 114, 112))
Chunks at level 3:
 ((1, 1, 1), (1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (54, 54, 52), (38, 38, 38, 38, 37))
Chunks at level 4:
 ((1, 1, 1), (1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (14, 14, 14, 11), (13, 13, 13, 13, 11))

and the sizes are

783M    B/03/0   # <--- total
690M    B/03/0/0
79M     B/03/0/1
9.8M    B/03/0/2
2.3M    B/03/0/3
3.0M    B/03/0/4

Notice that level sizes resembles more closely a sequence like 1, 1/9, 1/9^2, 1/9^3, ...

I'd say that the x10 reduction of chunk size seems too extreme, as it creates a very complex structure and unexpected sizes for different levels. But this is not yet based on some specific test with napari, nor on more realistic usecases. Also, the "extreme" choice runs quite slower (116 seconds with a x2 chunk-size reduction, 441 seconds with a x10 chunk-size reduction), and the resulting zarr folder is much heavier too transfer (via scp, for instance).

At the moment, I propose that we stick with an intermediate choice, for instance the x2 reduction which is currently hardcoded. Test with more realistic datasets or comparisons of napari performances are more than welcome!

jluethi commented 2 years ago

Thanks for the research on file sizes! Chunking may not actually give us much performance gain (at least with current ome-zarr napari plugin implementation). The x2 chunked, 10 pyramid level dataset performs much worse for full plates than the non-chunked, 5 levels of pyramids. I suspect it's a mix of the number of files and the number of directories (see also here: https://github.com/fractal-analytics-platform/mwe_fractal/issues/45). I do think chunking may become more valuable performance-wise once some issues of ome-zarr loading have been addressed. But if it is either similar in performance or not much worse, it may be easier to have the files unchunked. That would also make it easier to know what part was belonging to which region of an original image (e.g. important for the illumination correction we're planning to build next). @tcompa Could chunking be a parameter that is "off" by default, e.g. the default chunks are the input images as we had them originally?

jluethi commented 2 years ago

Summarizing our discussion at the call today regarding chunking: @tcompa & @mfranzon already have an improved chunking implementation: Chunks are decreasing in size, but only up to a limit. Then, larger original regions go into the same chunk => fewer chunks for higher pyramid levels, but chunks never become too tiny (large overhead, tons of small files => slow copying, slow visualization) Also, we are not going to use chunks as a metric for any of the further processing, as datasets can be chunked very differently. That means, it does become more important again to enable a napari visualization workflow where things are parsed into folders per field of view, not a single field of view per well. See https://github.com/fractal-analytics-platform/mwe_fractal/issues/36 for the discussion, we have a branch, but the ome-zarr-py library currently doesn't handle this well

tcompa commented 2 years ago

The current version (in the dev-parsl branch) has a very naive chunking strategy. We hard-coded the values which correspond to a x2 reduction of chunksize with respect to the original images:

    chunk_size_x = 256 * 5
    chunk_size_y = 216 * 5

Then we rechunk all levels of the pyramid with this size, meaning that as we go to higher levels (-> lower resolution) there will be less branches at each layer of the zarr structure, but the files at the deepest layers (the chunks for the X dimension) will always have a roughly constant size (not depending on the pyramid level).

This choice produces folders that come close to the ideal size ratio. For instance with a coarsening factor of 3 we expect a ratio of 9 between two subsequent levels, and we find

380K    20200812-CardiomyocyteDifferentiation14-Cycle1.zarr/B/03/0/4
1.3M    20200812-CardiomyocyteDifferentiation14-Cycle1.zarr/B/03/0/3
8.7M    20200812-CardiomyocyteDifferentiation14-Cycle1.zarr/B/03/0/2
78M     20200812-CardiomyocyteDifferentiation14-Cycle1.zarr/B/03/0/1
690M    20200812-CardiomyocyteDifferentiation14-Cycle1.zarr/B/03/0/0
tcompa commented 2 years ago

Note that until commit https://github.com/fractal-analytics-platform/mwe_fractal/commit/1d716cb013d603db36fce107cae86a69154f5767 the chunk size for X and Y were mixed (I had set chunk_size_x for the Y axis, and chunk_size_y for the X axis). This is now fixed.