Support for cloud-based datastores?

j6k4m8 commented 2 years ago

This looks like a super powerful tool, looking forward to using it! I'd love to implement an API abstraction for cloud datastores like BossDB or CloudVolume so that one could, in theory, generate peta-scale segmentation without having to download the data and reformat into n5/hdf.

These datastores tend to have client-side libraries that support numpy-like indexing: e.g:

# Import intern (pip install intern)
from intern import array

# Save a cutout to a numpy array in ZYX order:
em = array("bossdb://microns/minnie65_8x8x40/em")
data = em[19000:19016, 56298:57322, 79190:80214]

My understanding is that this should be a simple drop-in replacement for the ws_path and ws_key if we had a class that looked something like this:

from intern import array

class BossDBAdapterFile:

    def __init__(self, filepath: str):
        self.array = array(filepath)

    def __getitem__(self, groupname: str):
        return self.array

    ...

(I expect I've forgotten a few key APIs / organization, but the gist is this)

Is this something that you imagine is feasible? Desirable? My hypothesis is that this would be pretty straightforward and open up a ton of cloud-scale capability, but I may be misunderstanding. Maybe there's a better place to plug in here than "pretending" to be an n5 file?

constantinpape commented 2 years ago

Hi Jordan :).

Supporting BossDB or cloudvolume should indeed be relatively straight forward and would be a great addition here. I am using open_file from elf (another of my libraries that wraps most of the "custom" algorithms that are used here) internally to open n5, hdf5 and zarr files (also implements read-only support for some other file formats).

So the clean way would be to extend open_file s.t. it can return a wrapper around the cloud-stores that enables read and (if necessary) write access. The extensions for open_file are implemented here. Note that open_file currently just relies on the file extension, see here. But it would be totally ok to add some checks beforehand that check if the input is a url (or whatever address you would pass for the cloud-store) and then return the appropriate wrapper if it is.

j6k4m8 commented 2 years ago

Hi hi :)

Super super awesome! In that case I'll start retrofitting open_file — do you prefer I open a draft PR into elf so you can keep an eye on progress and make sure I'm not going totally off the deep end? Happy to close this issue in the meantime, or leave it open in pursuit of eventually getting cloud datasets running through these workflows!

constantinpape commented 2 years ago

do you prefer I open a draft PR into elf so you can keep an eye on progress and make sure I'm not going totally off the deep end?

Sure, feel free to open a draft PR and ping me in there for feedback.

Happy to close this issue in the meantime, or leave it open in pursuit of eventually getting cloud datasets running through these workflows!

Yeah, let's keep this issue open and discuss integration within cluster_tools once we can open the cloud stores in elf. I'm sure a couple more things will come up here.

j6k4m8 commented 2 years ago

Starting here! https://github.com/constantinpape/elf/pull/41

constantinpape / cluster_tools

Support for cloud-based datastores? #23