identify optimal chunk size for computational imaging, DL, and visualization workflows

czbiohub-sf / iohub

Pythonic and parallelizable I/O for N-dimensional imaging data with OME metadata

https://czbiohub-sf.github.io/iohub/

BSD 3-Clause "New" or "Revised" License

26 stars 6 forks source link

identify optimal chunk size for computational imaging, DL, and visualization workflows #33

Open mattersoflight opened 1 year ago

mattersoflight commented 1 year ago

When data is stored in locations that have significant i/o latency (storage server on premise or on cloud), a large number of small files create excessive overhead in data transfer during nightly backups, transferring data across nodes, and transferring data over the internet. We need to decide a default chunk size for the data that will be read and written by DL workflows such as virtual staining. We need to test how chunk size affects the i/o speed on HPC.

@Christianfoley can you write a test for i/o performance (e.g., reading a batch of 2.5D stacks and writing the batch to a zarr store) using iohub reader and writer, and run the tests on HPC?

@ziw-liu can then evaluate if the chunk size that is optimal for DL workflow works fine for recOrder/waveOrder.

Choices to evaluate (assuming that camera that acquires 2048x2048 image)

Chunk by image (chunk size = 1x1x2048x2048).
Chunk by channels (chunk size = Cx1x2048x2048).
Chunk by stack (chunk size = 1XZX2048x2048).

@JoOkuma how sensitive is performance of dexp processing steps to chunk size?

JoOkuma commented 1 year ago

@mattersoflight, we've encountered this problem in the past, where we had a huge performance drop when using a lot of (1, 1, 512, 512) tiles, especially when writing. So we defined our default to compute the largest number of z-slices that zarr supports given the data type and y, x size.

However, I heard this limitation is not due to having a lot of files but because they're in a single directory. So having subfolders should improve that, I never tested this though, but that explains why ome-zarr spec uses a NestedDirectoryStorage. 1 reference, more can be found online.

I would benchmark the performance of DirectoryStorage vs NestedDirectoryStorage. In our use case, having a (1, 1, max Y, max X) would be ideal.

JoOkuma commented 1 year ago

Regarding DL, IO is much slower than forward/backward pass in our models, so we implemented a cache where we store in memory a few 3D volumes and sample patches from it; after a few iterations, we clear the cache and sample a new set of 3D volumes. Since the patches are random, we rather sample from the memory cache than from zarr directly because we have no guarantee they will match the chunks. What happens is that if the chunks are small, we end up sampling multiple chunks (slow), and if the chunks are large, we will sample a single chunk per patch, but they're also slow because they're large.

Recently, I went back and started sampling patches from the dataset, saving them in an auxiliary directory, and using a standard directory dataset to process them. It's much faster, simpler, and, more importantly, I'm sure I'm always using the same training set when benchmarking multiple models, so it's a fair comparison.

All of this is regarding training. For inference, I think the fastest approach is just to use multiple SLURM jobs, and you don't need to worry that much about IO.

mattersoflight commented 1 year ago

I went back and started sampling patches from the dataset, saving them in an auxiliary directory, and using a standard directory dataset to process them. It's much faster, simpler, and, more importantly, I'm sure I'm always using the same training set when benchmarking multiple models, so it's a fair comparison.

Good point! Considering this, setting the chunk size to image size turns out to be a good default with data stored in a nested hierarchy.

@ziw-liu Is re-chunking a given zarr store (that may be chunked by z) easy and efficient?

ziw-liu commented 1 year ago

Is re-chunking a given zarr store (that may be chunked by z) easy and efficient?

The implementation can be as easy as a few lines, or there's also existing tools for (computational) scalability.

Performance, however, cannot be guaranteed. Zarr arrays are compressed, so there is no simple magic to play when it comes to having to read every byte of data into RAM and then write back to disk. I think the limiting factor will likely be either network bandwidth or storage IOPS.

We can take a specific example to test if this will work for us.

ziw-liu commented 1 year ago

For context, the current implementation in #31 allows for any chunk size to be specified and will default to the ZYX stack shape if None is provided:

https://github.com/czbiohub/iohub/blob/278467209e9da20929ade781de5bbe9c19e3940a/iohub/ngff.py#L408-L444

So downstream applications can conveniently modify chunking behavior regardless of the 'default' we impose. And I think it is also better to be explicit about chunking when coding for IO-sensitive tasks.

mattersoflight commented 1 year ago

The question of chunk size came up again at the Donuts & Development, and that we should measure it. The chunk-size vs read/write benchmarks will guide both the users of the library and developers. For example, we can test the speed gain with tensorstore #26.

I'd bump this up in priority - right after we have the feature-complete release 0.1.0.

@ziw-liu and @AhmetCanSolak what benchmarks make sense for our HPC environment? Are there some existing benchmarks that should be refactored into iohub?

ziw-liu commented 1 year ago

what benchmarks make sense for our HPC environment?

Our HPC environment is relatively stable. So I think that we should design the test based on typical workloads, parameterizing dataset types and analysis pipelines. A good starting point may be gathering representative datasets and converting them. A point @AhmetCanSolak brought up is that we also need to decide on benchmarking infrastructure (e.g. asv) so that they are automated and reproducible.

AhmetCanSolak commented 1 year ago

I believe default chunk_size selection is not a dealbreaker for many use cases. It seems fair to expect API consumers to set their chunk size per their needs. Different pipelines can be benchmarked against varying chunk sizes, as @ziw-liu mentioned, but I don't think those need to live on iohub repository. What @JoOkuma suggests is interesting (benchmarking the performance of DirectoryStorage vs NestedDirectoryStorage), can be part of our benchmark suite. Also, I created an issue on benchmarking, please give your input here( #57 ) regarding what you like to see in our benchmark suite.