Open mrocklin opened 5 years ago
@jakirkham mentioned the following in a separate conversation:
Does store or to_zarr not work? This is my sense of what people do today.
Could be, I don't actually do this work. So people today don't write many images out? I would expect that for analysis many people would use something like Zarr or HDF as intermediate formats, but that for long time archives, sharing, or publishing people would still want to save to PNG or TIFF or something.
So people today don't write many images out? I would expect that for analysis many people would use something like Zarr or HDF as intermediate formats, but that for long time archives, sharing, or publishing people would still want to save to PNG or TIFF or something
Microscope recording software definitely writes out many images today. This is used as input for analysis and is also archived for long term storage. This may also be the thing that is shared with others.
What users produce is dependent on their analysis. One use case is to produce Regions of Interest, which could live happily in JSON. Another use case is to do some cleanup on this data and ingest it into some sort of centralized database. Other use cases produce Zarr/N5 files or HDF5 files, which may be shared and used for further analysis or could go into long term storage.
Publication/sharing may mean hosting the data with a web server, which means having a robust database to back it is pretty important. It could also mean generating some figures in a paper, which are likely generated outside of the analysis pipeline altogether.
Just FYI. The Satpy project uses Dask to process satellite imagery in a chunk-based fashion. It allows saving results to disk as a GeoTIFF, PNG etc.
I needed this in my science work and came up with this, based on gufuncs:
import dask.array as da
from skimage.io import imsave
def da_imsave(fnames, arr, compute=False):
"""Write arr to a stack of images assuming
the last two dimensions of arr as image dimensions.
Parameters
----------
fnames: string
A formatting string like 'myfile{:02d}.png'
Should support arr.ndims-2 indices to be formatted
arr: dask.array
Array of at least 2 dimensions to be written to disk as images
compute: Boolean (optional)
whether to write to disk immediately or return a dask.array of the to be written indices
"""
indices = [da.arange(n, chunks=c) for n,c in zip(arr.shape[:-2], arr.chunksize[:-2])]
index_array = da.stack(da.meshgrid(*indices,indexing='ij'), axis=-1).rechunk({-1:-1})
@da.as_gufunc(signature=f"(i,j),({arr.ndim-2})->({arr.ndim-2})", output_dtypes=int, vectorize=True)
def saveimg(image, index):
imsave(fnames.format(*index), image.squeeze())
return index
res = saveimg(arr,index_array)
if compute == True:
res.compute()
else:
return res
Would it be useful to build into a pull request, either here on in dask/dask? What would still be needed for that?
Hi @TAdeJong!
What would be needed is: (a) For us to decide how saving should work in dask-image. This is could be a bit of a bottleneck. (b) To make a saving function that is a little more general than your example above. You have a few assumptions that probably wouldn't work for everybody (eg: that you have a 2D image, that the last two dimensions of the array describe spatial dimensions, etc). Some of this will depend on the result of the discussion in (a).
Re: comments by @mrocklin and @jakirkham : As I see it:
I think we should prioritise group 1 with a view to extending to groups 2 (and perhaps 3?) down the track.*
*Edit: upon reflection only a small part of this is a plausibly good idea.
Hi @GenevieveBuckley , (a): I was comparing to dask/dask/array/image.py and think it would be at least nice to get similar capabilities writing out as reading in. In that sense, I think it might be a good idea to put both reading and writing capabilities in the same place, but beyond that I have no opinion whether this should be in core dask or in dask-image. (b) I agree that color/multichannel support is desirable and is not hard to add in this code (via an explicit switch + guessing based on the last dimension, i.e. if it has length 3 or 4. For images, I think memory layout wise it only makes sense if the last 2 (or 3 in case of RGB(A)) dimensions are the individual images, so I would assume an explicit transpose/swapaxis by the user would be the way to go there, of course in combination with clear documentation/example.
Regarding the compressed way to write out data, I wonder if there are any features that would be needed in addition to what dask.array.to_zarr()
offers?
I do think there's a place for functionality that saves image files (even if it's a basic functionality) in dask-image. So no replicating functionality that already exists in dask itself (like dask.array.to_zarr()
), but we might have something specifically for saving to image specific formats.
When I say "more than just 2D arrays", I don't only mean that sometimes we have colour channels. As a rough guide, I have to think about:
So we can expect typical data might have anywhere between 2 and 5 dimensions, and there's often a lot of variety in which order we see those dimensions.
Has any progress been made on this? I'm working on image processing and one of the smallest image sizes in my current dataset is 81000 by 31000 pixels. There isn't a quick way to save an array of this size as a PNG.
Hi @sumanthratna No, there hasn't been any activity on this in the last couple of months.
You could try either adapting TAdeJong's script above for your purposes, or look at the Saalfeld lab's N5 library for reading/writing large arrays. It can write to file in parallel, which might help with speed. Caveats: I haven't used this library myself but just chatted to Stephan about it a few months ago; it's still in the early stages so it might not have the features or documentation you need for your project; and image chunks cannot be larger than 2GB which may or may not work for you. Good luck!
Just a note that using to_zarr
as in these examples, should also write this out to disk in parallel.
Related discussion: https://github.com/dask/dask/issues/3487
Hello there,
first of all I want to thank you all for the library which made my expirement possible. I am quiet familliar with python and would be happy to implement this method. I suppose that it wasn't implemented before because there is some kind of difficulties. May I ask what are the major issues/difficulties?
I think Genevieve's comment above ( https://github.com/dask/dask-image/issues/110#issuecomment-519009634 ) pretty accurately captures the tricky points that would need to be addressed.
I'm really appreciating dask-image so far and an imsave/imwrite to e.g. tiff would make it even better.
I found myself reaching for an
imsave
function to complimentimread
. Presumably this would have similar semantics, and would effectively map over theskimage.io.imsave
function, or something else in pims.I don't have a concrete need though, this just came up when writing up an example.