Open jmontoyam opened 3 years ago
Hi @jmontoyam,
you're completely right that dask_image
shouldn't be significantly slower than the "pure dask" example you provide. Cool that you provide code and timings!
I had a look at your example and found that the problem is that dask_image
reads each individual tile using pims.open
on the input file pattern, rather than reading in the relevant file in a more targeted manner. So for each read, pims
instantiates an ImageSequence
, and this is taking very long. These are the relevant lines of the dask_image
code:
I guess for the application you're working on your "pure dask" code does the trick. As you also commented, it's okay that it's a bit slower than the sequential non-dask version, since its primary purpose consists of providing lazy data chunks.
For dask_image
it would be great to fix this though. This issue relates to https://github.com/dask/dask-image/issues/121. I'm writing a PR proposing a solution (will be linked here).
Thank you very much @m-albert!, I am happy to contribute for the first time to this amazing project, even though it is just a tiny bug report :relaxed: ...Please excuse me the following beginner question: why does dask-image use pims for loading the images?...why not use, for instance, a standard delayed skimage.io.imread?
why does dask-image use pims for loading the images?...why not use, for instance, a standard delayed skimage.io.imread?
@jmontoyam That's a good question. Regarding the delayed
part of your question, dask_image
currently uses da.map_blocks
which can lead to cleaner dask graphs than constructing arrays using delayed
(would need to double check this though).
Regarding the skimage.io.imread
part of the question, in this thread https://github.com/dask/dask-image/issues/18 @jakirkham argues that pims
can return information about image dimensions in a lazy manner and supports the most file formats.
Thank you very much for the explanation @m-albert !...By the way, I have checked the #182 , and everything looks good to me, thank you very much for taking the time to address this issue! ;)
Sorry, clicked the wrong button!
I've just merged #182 from @m-albert now.
Since it seems like that fixes @jmontoyam's problem, I'll close this issue now. Feel free to reopen this issue if I'm wrong about that.
Reopening this issue, based on more recent discussions about the performance of imread.
From @m-albert in https://github.com/dask/dask-image/issues/194#issuecomment-791987698
Critically, dask.array.image.imread() does not show the same problem - https://github.com/dask/dask/blob/master/dask/array/image.py
Would we lose much functionality if that became the canonical way to open images? Perhaps we could adapt it to use pims instead of skimage by default.
This one wasn't on the list of imread functions profiled here. I'd say profiling the performance here might be a good next step. If it performs decently, maybe we move in that direction.
Here's a performance comparison:
import skimage.io
import pims
def pimread(fn):
with pims.open(fn) as imgs:
return np.asarray(imgs[0])
%%time
for i in range(10):
dask_image.imread.imread(os.path.join(folder,'im_*.tif')).compute(scheduler='threads')
# 8.7s Nz, Ny, Nx = 1000, 100, 100
%%time
for i in range(10):
dask.array.image.imread(os.path.join(folder,'im_*.tif'), imread=pimread).compute(scheduler='threads')
# 6.8s Nz, Ny, Nx = 1000, 100, 100
%%time
for i in range(10):
dask.array.image.imread(os.path.join(folder,'im_*.tif'), imread=skimage.io.imread).compute(scheduler='threads')
# 5.1s Nz, Ny, Nx = 1000, 100, 100
So dask.array.image.imread
leads to faster computations than dask_image.imread.imread
and skimage.io.imread
is faster than pims.
For array creation, since dask.array.image.imread
reads the first tp from disk for determining image shape and dask_image.imread.imread
uses pims
to determine shape, the latter can be faster in the case of a few huge files. Otherwise dask.array.image.imread
is much faster.
%%time
dask_image.imread.imread(os.path.join(folder,'im_*.tif'))
# 470ms Nz, Ny, Nx = 2, 10000, 10000
# 250ms Nz, Ny, Nx = 10000, 10, 10
dask.array.image.imread(os.path.join(folder,'im_*.tif'))
# 1.05s Nz, Ny, Nx = 2, 10000, 10000
# 50ms Nz, Ny, Nx = 10000, 10, 10
From @jakirkham in https://github.com/dask/dask-image/issues/194#issuecomment-792002652
One could create a hybrid solution. Namely use PIMS just to find
shape
anddtype
info and then useskimage.io.imread
or similar to read in each chunk on workers
From @jakirkham in #194 (comment) One could create a hybrid solution. Namely use PIMS just to find shape and dtype info and then use skimage.io.imread or similar to read in each chunk on workers
That's a good idea. Given the good performance of skimage.io.imread
, probably it makes sense to use it as a default for reading images. Do you think it makes sense to add scikit-image
as a dependency for dask-image
? Alternatively, we could try importing it and use pims
as a fallback:
try:
from skimage.io import imread as imread_func
except (AttributeError, ImportError):
def imread_func(fn):
return np.array(pims.open(fn))
pass
Yeah we are already thinking about having some scikit-image style APIs. So adding scikit-image as a dependency makes sense
cc @jni (in case you have thoughts here 🙂)
Doesn't scikit-image use imageio
to read in images? I think that's the default, although they do sometimes use tifffile, or other image reader plugins.
scikit-image is a very large dependency (mostly because it includes scipy, which I guess is something we already do)
One thing I wish we had was a lightweight, minimal dependency way to install just an image reader to get stuff into a dask array. I'm not sure if there's a really clean way to handle this.
Does it? I got the impression it was more complicated than that. Feel free to contradict me though if that's not accurate (guessing you or Juan would know better)
Yeah we've discussed with imageio before if they could provide the shape and dtype without loading the full image ( https://github.com/imageio/imageio/issues/362 ). This would actually let it take the place of PIMS for Dask Array construction and handle the data loading. Unfortunately that hasn't been solved yet.
From @jakirkham in #194 (comment) One could create a hybrid solution. Namely use PIMS just to find shape and dtype info and then use skimage.io.imread or similar to read in each chunk on workers
That's a good idea. Given the good performance of
skimage.io.imread
, probably it makes sense to use it as a default for reading images. Do you think it makes sense to addscikit-image
as a dependency fordask-image
? Alternatively, we could try importing it and usepims
as a fallback:try: from skimage.io import imread as imread_func except (AttributeError, ImportError): def imread_func(fn): return np.array(pims.open(fn)) pass
One thing that is worth noting here is PIMS does seem to be fairly clever about loading a range of pages from a TIFF, which can be handy if the TIFFs are massive. Loading a smaller range of pages can result in more manageable chunks for Dask. While I have seen this use case in the wild, not everyone does this. I'm guessing it is more a function of the acquisition software people are using. Not saying we should tailor things to that use case, but had forgotten this detail until we started discussing this recently. So wanted to share that context
Does it? I got the impression it was more complicated than that. Feel free to contradict me though if that's not accurate (guessing you or Juan would know better)
I think you're probably right.
One thing that is worth noting here is PIMS does seem to be fairly clever about loading a range of pages from a TIFF, which can be handy if the TIFFs are massive. Loading a smaller range of pages can result in more manageable chunks for Dask. While I have seen this use case in the wild, not everyone does this. I'm guessing it is more a function of the acquisition software people are using. Not saying we should tailor things to that use case, but had forgotten this detail until we started discussing this recently. So wanted to share that context
That's a really important point.
This seems like a good point to mention Nick's earlier question about loading movie files efficiently https://github.com/dask/dask-image/issues/134, which is also an important use case to keep in mind. For what it's worth, I'm collaborating with a group who are working with .mp4
video data so this is relevant for me now too.
Thanks for reminding me about Nick's use case.
If we do care about optimizing both use cases, we could try to detect when chunks are smaller than the original files and handle these with PIMS. Otherwise just load the whole thing into memory with scikit-image (or something else we decide to use here). This could get a little tricky, but think this is doable.
Not sure whether this will be easier to do during graph construction or at image load time. So probably whichever is easier would be the way to go initially and we could change things if there's a performance benefit later
I do have opinions here!
(1) skimage.io will eventually become a thin wrapper around imageio. (2) imageio will start to do smarter things around lazy loading, see https://github.com/imageio/imageio/issues/569, https://github.com/imageio/imageio/pull/574, and the links therein. (3) overall there should be a community-wide effort around this, see https://blog.danallan.com/posts/2020-03-07-reader-protocol/
None of this is particularly helpful re dask-image's present choice, except to say that maybe some/all of the effort in this discussion should go towards those issues rather than towards adding Yet Another way of wrapping wrappers around IO libraries.
Re tifffile, for reading tiffs it always boils down to that (whether you're using imageio or skimage.io), so if you want to do lazy loading of big tiffs I suggest implementing it on top of tifffile directly — it certainly has that capability, no need for PIMS here.
One thing I wish we had was a lightweight, minimal dependency way to install just an image reader to get stuff into a dask array. I'm not sure if there's a really clean way to handle this.
@GenevieveBuckley
pip install imageio
:P Currently, it depends on numpy
and pillow
, but if pillow
is too much (common formats like .jpg, .png, .gif), there is the option to install it using pip install --no-dependencies imageio
and then only install the plugins you actually want/need.
One can also envision a dask plugin/wrapper to allow loading the image straight into a dask_image and avoid a copy. I'm thinking about doing this for pytorch and tensorflow at some point down the line (directly load into pinned memory) because my current lab does a lot of deep learning with images. __array_interface__
may already do this trivially for dask, because imageio returns a numpy array.
For what it's worth, I'm collaborating with a group who are working with .mp4 video data so this is relevant for me now too.
This is also something that can be done with imageio 🤩. We maintain an ffmpeg wrapper (pip install imageio[ffmpeg]
), which can then read videos frame-by-frame or as an image stack via the familiar API. It will get even better once we have https://github.com/imageio/imageio/pull/574 merged. In the future, I hope to replace this with a wrapper around av, because there is little point in duplicating the effort of cleanly wrapping and shipping ffmpeg.
Thoughts and feature requests are (of course) appreciated.
Does it? I got the impression it was more complicated than that.
@jakirkham It will 👼 as soon as I get https://github.com/imageio/imageio/pull/574 merged and find the time to write the wrapper for skimage.
Yeah we've discussed with imageio before if they could provide the shape and dtype without loading the full image [...]
Metadata for images is a never-ending story xD I think the reason there is no clear standard for it yet in imageio is that every format has its own set of metadata, so it is non-trivial to find common ground that we can guarantee to provide for all formats. For me specifically, the user-side is a bit of a black box, because I've not really seen use-cases yet where people actively consume metadata; then again I've only recently joined this particular corner of the internet, so there is a lot I may not know (yet).
@jakirkham are there any disadvantages you see in @FirefoxMetzger's comment?
A quick update/question (to bring this thread back from the dead): I have started looking into adding out
and like
arguments to some of ImageIO's plugins (https://github.com/imageio/imageio/issues/670) to improve I/O speed. In particular, I'd like to bring down the number of copies involved in the overall process. Since I was also thinking of making this work with dask arrays, too, I have two questions:
Regarding like
: Can dask efficiently consume the Buffer Protocol? If no what would be the most efficient/desirable way to construct a dask array from a memoryview? Wrap the buffer into a numpy array and call from_array
on it?
Perhaps more importantly regarding out
: Is it possible to get a writable memoryview
into dask's data? This is me being naive and unknowledgable about dask's internal layout, but what I am hoping for is to have dask create a (C-contiguous) buffer into which I can directly place decoded pixel values. This way, we could avoid allocating intermediary buffers and also reuse existing structures, which will be really useful for big datasets, e.g., when iterating over a video in batches.
Also, if I should not necromance this thread and instead start a new one to discuss this, please let me know and I'll do that :D
Sorry, but I don't really follow. This doesn't seem relevant to Dask AFAICT. It isn't NumPy. Dask Array's are lazy and do not themselves support the Python Buffer Protocol. Individual Dask chunks would be created by asking ImageIO to open a file. Generally Dask Arrays expect NumPy or NumPy-like chunks. Also Dask defers to other libraries to create memory they need to consume.
From the Dask perspective, the really important thing here is Dask needs to first construct a task graph, which contains accurate metadata. Namely the shape
and dtype
. Ideally we would want to load that metadata as fast as possible. This is what users here are raising. They want that metadata loading step to be faster. This is where ImageIO could potentially help.
Ideally we want to avoid loading images (slow) to get that metadata. Reading headers of files would be ideal. At least from Dask's perspective, it doesn't really need to know more than shape
& dtype
. This got mentioned above in this comment ( https://github.com/dask/dask-image/issues/181#issuecomment-797177004 ). Have copied the relevant bit below:
Yeah we've discussed with imageio before if they could provide the shape and dtype without loading the full image ( imageio/imageio#362 ). This would actually let it take the place of PIMS for Dask Array construction and handle the data loading. Unfortunately that hasn't been solved yet.
Dask Array's are lazy and do not themselves support the Python Buffer Protocol. Individual Dask chunks would be created by asking ImageIO to open a file. Generally Dask Arrays expect NumPy or NumPy-like chunks.
Oh, in that case, sorry for the noise. I was under the impression that images are loaded centrally and that chunks are then distributed as needed. Since this is not the case, my mental model is off and this is indeed not too relevant.
At least from Dask's perspective, it doesn't really need to know more than shape & dtype.
I think we can do that. It is possible for the major plugins which cover all commonly used image formats, so that should get us 90% there. I'll look into this once my current round of PRs is merged 👍
No worries. Dask would run the loading step on workers that would use the data directly. In general Dask tries to minimize communication both in its usage model and when distributing tasks to workers as communication can get pretty expensive. While I can imagine use cases that might benefit, unfortunately I don't think Dask is one of them.
That would be incredibly useful! Thank you 😄
More discussion at https://github.com/dask/dask/pull/8385
Dear dask_image community,
I am a new dask_image user. Maybe, due to my beginner level, I am doing something wrong, but I noticed that reading a collection of images using dask_image is much slower that using single-threaded skimage. I have installed the latest dask_image version available in pypi (dask_image version 0.4.0). In the following example, I am reading 398 images, all of them with the same dimension (64x10240, uint16). Taking into account the dimensions and numbers of images, I would expect dask_image to be slighly slower that single-threaded skimage (due to the tiny dask overhead involved in opening this small number of "tiny images"), but instead the performance of dask_image is much slower (around 24x). Then I proceed to implement the image reading function in pure-dask and the performance is much better than the one obtained with dask_image. In the following I will report the benchmarks results (all the following code-snippets load the same data successfully):
import glob import numpy as np import skimage.io import dask_image.imread from dask import delayed import dask.array as da
Single-threaded-skimage baseline
Elapsed time: 510 milliseconds
Using dask_image
Elapsed time: 12.1 seconds
Using pure-dask
Elapsed time: 1.09 seconds
Using dask-image with synchronous scheduler
Elapsed time: 3 seconds
Using dask-image with processes scheduler
Elapsed time: 6.63 seconds
Using dask-image with threads scheduler
Elapsed time: 12 seconds
Environment:
Thank you very much for your all help ;)