dask / dask

Parallel computing with task scheduling
https://dask.org
BSD 3-Clause "New" or "Revised" License
12.39k stars 1.69k forks source link

Providing URLs to imread #5691

Open djhoese opened 4 years ago

djhoese commented 4 years ago

The imread function currently performs a glob on the filename given to it as it expects a glob pattern to find local files. I have a case where I have a series of URLs, something that skimage imread can handle, and would like to load them in to one large dask array. I can see by reading the code that this is outside the expected use case of the function, but I think it could be updated to handle it. I'm wondering what be the best way to accomplish this if it fits in with what the original authors expected. Here's the main problem line:

https://github.com/dask/dask/blob/d0daa5bc7e86677b38794d4f9294fcc386f7b067/dask/array/image.py#L48

I'd like to make this take an iterable in which case we expect it not to be a glob pattern and don't assume that the image is on the local system at all.

if isinstance(filename, str):
    # assume glob
else:
    # assume iterable of URIs that exist
# continue on as normal

Thoughts? Problems?

djhoese commented 4 years ago

Shoot, this is a problem too (getting mtime info for the token):

https://github.com/dask/dask/blob/d0daa5bc7e86677b38794d4f9294fcc386f7b067/dask/array/image.py#L52

jcrist commented 4 years ago

Updating this to match the other IO routines makes sense to me. I suggest looking at how other IO routines interact with fsspec. The json reader may provide the simplest example of this: https://github.com/dask/dask/blob/d0daa5bc7e86677b38794d4f9294fcc386f7b067/dask/dataframe/io/json.py#L201-L213. Would you be willing to work on a PR?

mrocklin commented 4 years ago

In this case I recommend taking a look at http://image.dask.org/en/latest/ , which I think has a more advanced method. cc @jakirkham

On Mon, Dec 9, 2019 at 9:14 AM Jim Crist-Harif notifications@github.com wrote:

Updating this to match the other IO routines makes sense to me. I suggest looking at how other IO routines interact with fsspec. The json reader may provide the simplest example of this: https://github.com/dask/dask/blob/d0daa5bc7e86677b38794d4f9294fcc386f7b067/dask/dataframe/io/json.py#L201-L213. Would you be willing to work on a PR?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask/issues/5691?email_source=notifications&email_token=AACKZTAQFVL5VLOPK2PASJ3QXZ4JDA5CNFSM4JYC5R52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGJ6B2A#issuecomment-563339496, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTBNYRMZL2KIRXAX2YTQXZ4JDANCNFSM4JYC5R5Q .

djhoese commented 4 years ago

I took a look at dask-image. It seems to use pims library's open function. I avoided this initially since it looked like it only accepted single files or glob patterns based on the quick start, but looking deeper it should take a list of filenames.

I wanted to use this functionality for a simple side project (local python meetup kaggle competition), but I ended up using some parts of scikit-image that aren't dask friendly from what I can tell and the computations I'm doing are per-frame so having one giant dask array doesn't necessarily benefit me.

That said, I brought this up and bugged you so I'll try to add the functionality. @jcrist, making the IO functions consistent would be nice. I'm wondering if there is a reason that this imread function was written to use mtime though. I'm a little worried to break this. Regardless, maybe the dask-core version of this should stay as-is and I should look at dask-image more.

djhoese commented 4 years ago

So looks like @mrocklin has run in to this already. pims.open doesn't like lists which makes dask-image break. I was about to file a bug with pims since their documentation says a list should work, but found this: https://github.com/soft-matter/pims/issues/310

jrbourbeau commented 4 years ago

I'm wondering if there is a reason that this imread function was written to use mtime though

I believe including the last modified time in the array hash is to avoid a re-computation in the case that imread(files) has already been computed and the files haven't been modified since

djhoese commented 4 years ago

Right, but how often do image files being analyzed get modified? If this is common or something that people are worried about should this kind of check be adopted by all I/O reading functions?

jakirkham commented 4 years ago

If there are other things you need from dask-image's imread, issues on that repo would be welcome. 🙂