NERC-CEH / plankton_ml

A project for image processing and analysis pipelines for plankton sampling
GNU General Public License v3.0
0 stars 1 forks source link

Confusion about the use of dask/xarray in this project #25

Closed jmarshrossney closed 3 weeks ago

jmarshrossney commented 2 months ago

I am somewhat aware of the 'headline' uses of dask and xarray but have never used them myself, so forgive me for being a little confused.

First, I think I'm correct in saying that at present there is nothing implemented that would suffer at all from using pandas/numpy instead. So my simple mind is thinking why not just use the simpler tools up until the point where they are no longer enough.

I note that the majority of current dask/xarray use originates from using intake-xarray to load the data, which in turn comes from scivision, so I don't know if that's the only reason it's there or if I'm just not understanding what the near future looks like and how dask/xarray will help.

So I have some questions related to xarray...

  1. Is the idea to include useful metadata in the .tif files?
  2. If 1 is correct, is this the main reason to use xarray so that metadata can be 'carried around' with the image?
  3. If 2 is correct, is this actually useful to us? Can we not just construct a pytorch Dataset that serves the same purpose?
  4. If 1 is incorrect, what other reasons are there to use xarray?

and related to dask...

  1. Is the idea to eventually lazily load a large dataset of images from the s3 bucket using dask?
  2. If 5 is correct, what is the context in which we would do this?
  3. If 5 is correct, are we sure that's doable given that images are all different sizes?
  4. If the answer to 6 is 'training', do we have any idea of the scale of dataset we are looking at? My naive guess would be that downloading batches from the remote storage at computation time is going to be a bottleneck that we would only tolerate if we physically couldn't fit the dataset into local memory.
  5. If the answer to 6 is 'evaluation', wouldn't eventual users be evaluating images individually or in small batches? In which case how does dask help?

To be clear I would expect dask/xarray to potentially make life much simpler down the line, I just want to get up to speed with others in understanding precisely how.

@mattjbr123 your wisdom would be very gratefully received!

metazool commented 1 month ago

It's a good series of questions to which i only have under-articulated answers!

The inclusion of dask/xarray comes via intake, which is part of the same ecosystem (and all came through scivision's integration with intake-xarray). We're not making use of dask/xarray in a way that would gain the benefits (e.g. by spreading around the image preprocessing across a dask cluster to really scale it up) - you're right to call it as adding surplus complexity

And also spot on about arbitrary image sizes meaning that we're passing them through the model one at a time, not in batches. (I don't think they're completely arbitrary, but a small set of sizes - but i'm not certain - this starts to veer into project/dataset specific needs again, if we're trying to take a radical approach to prioritising any technical work by its reuse value)

mattjbr123 commented 1 month ago

Hey y'all, I can attempt to answer some of these questions from my own experience of using xarray/dask.

The original idea behind xarray as far as I understand it was to enable metadata-aware calculations of netcdf data, the most basic example being selecting out datapoints based on coordinates such as time/lat/lon instead of relying on manually constructed indices that you would have to do with the original netcdf4 python library. It's expanded a whole heap since then, with support and addons for different data types, processing backends such as dask and convenience functions.

My 'journey' with it has been similar, I started off using it as just a more convenient way of processing netcdf data then when the data got too big to fit in memory, started exploring/using its dask capability. You are able to use dask naively, and given xarray invokes it automatically in many contexts this is often what happens. I saw 'dask can help you process larger-than-memory data' and used this off-the-shelf xarray implementation to process larger-than-memory data to something that fit in memory or could be saved to disk. This worked fine for simple workflows but anything complicated and you really need to start thinking carefully about how implement the parallelism dask offers or you end up with a workflow that best case takes way longer to complete than it needs to or grinds to a halt completely.

I learnt this the hard way when, among other things, trying to rechunk data for my object storage work (which I eventually found a better solution for). For reasons that are now lost to me, certain rechunking operations (perhaps when a full reshuffle of the data is required, and the way dask constructs the task graph (or' DAG' I beleive though I've always found the term unnecessarily jargony) meant that it tried to read in the whole dataset into memory at once, defeating the point of using it) were a particular fail case of dask, there's many a pangeo discourse forum and github issue thread on this, which I can probably dig out if of interest. I beleive this issue has largely been fixed now, but I'd learnt my lesson not to trust the automatic task-graph that xarray-with-dask constructs if you don't manually step in.

I think the TL;DR of all that is that you shouldn't just think 'oh my data is now bigger than memory, I'll invoke the dask backend of xarray and hope for the best'. It might work, but there's a good chance it won't.

So, to actually answer your questions:

  1. Is the idea to include useful metadata in the .tif files?

Yes

  1. If 1 is correct, is this the main reason to use xarray so that metadata can be 'carried around' with the image?

Yes

  1. If 2 is correct, is this actually useful to us? Can we not just construct a pytorch Dataset that serves the same purpose?

Probably, if you don't mind doing it and are not/won't be making use of any of the other features of xarray

  1. Is the idea to eventually lazily load a large dataset of images from the s3 bucket using dask?

This is a good use case for xarray/dask, I have some small example notebooks somewhere on datalabs which I could share with you if useful

  1. If 5 is correct, what is the context in which we would do this?

For me, and speaking generally, it's to at minimum 'allow' researchers to access and analyse large datasets easily, and ideally provide a semi-standardised workflow that allows this with some boilerplate code to abstract away some of the complications of accessing and analysing large datasets. That's sortof my aim with my work in FDRI anyway.

  1. If 5 is correct, are we sure that's doable given that images are all different sizes?

Knowing very little about this project I'm not certain, but my hunch would be that it's still possible but would require some thought to chunk/structure the data sensibly

  1. If the answer to 6 is 'training', do we have any idea of the scale of dataset we are looking at? My naive guess would be that downloading batches from the remote storage at computation time is going to be a bottleneck that we would only tolerate if we physically couldn't fit the dataset into local memory.
  2. If the answer to 6 is 'evaluation', wouldn't eventual users be evaluating images individually or in small batches? In which case how does dask help?

This is ultimately the trade off with S3 right, tolerate the longer latency of having the data stored remotely instead of on local disk to gain the scalability/parallelism and ability to store and analyse gargantuan datasets. As you're hinting at, it's not always the appropriate solution. At the extreme end of analysing data that fits perfectly fine in memory it's definitely overkill.

Hope some of that semi-structured ramble was useful!!

jmarshrossney commented 1 month ago

Thank you both!

So it comes down to whether we are mainly using this project as a way to develop a more general pipeline that could be dropped in elsewhere with minimal modification, or whether we are focused on reaching the best solution for the specific problem posed by the plankton project?

Edit: that was intended to read as a question rather than a statement! (added ?)

mattjbr123 commented 1 month ago

Sounds about right - I'm interested to see where you take this, so keep me posted :)

metazool commented 4 weeks ago

Thank you both for the engaged and helpful commentary here!

This small experimental project definitely doesn't have "scaling up big" as one of its intrinsic aims

Following on from this comment I've now completely removed the xarray - dask - intake components that came along wholesale when we tried reusing work from scivision, in #36

There's less to it and it's more readable now.

In my previous gig, the folks from Nvidia/MS were heavily pushing Triton Inference Server. I never dug into it, but it might be worth visiting for a future effort to "BioCLIP all the things" or for some of the LLM track of work that @matthewcoole is on. It might equally be "buy more GPUs from we" marketing material, this isn't a recommendation!

Planning to close this once #36 is completed