Workflow ideas - Githubissues

jrbourbeau commented 1 year ago

Note: I'm moving over the list of proposed workflows from the roadmap to this repo. I'll continue to iterate a bit on this issue

Data loading and cleaning

Dask is often used to schlep data from one format to another, cleaning or manipulating it along the way. This occurs in both dataframe and array use cases. There are lots of possible configurations here, but we’ll focus on just a few to start.

[ ] https://github.com/coiled/coiled-runtime/issues/726

Exploratory Analysis

This is where most of our demos live today. Load a dataset, fool around, make some pretty charts

[x] Uber/Lyft, perform various simple dataframe computations and find novel results
[ ] https://github.com/coiled/coiled-runtime/issues/770
[ ] ~~RenRe~~ punting during our first pass over workflows

Embarrassing parallel ✅

The matplotlib-arXiv notebook is a good example we have today of embarrassingly parallel workflows. This is “Dask a as a big for loop”. It also shows cloud data access and processes 3TB of real data.

[x] https://github.com/coiled/coiled-runtime/pull/724

Imaging

There is a surprisingly large community of people using Dask for bio-medical imaging. This includes applications like fMRI brain scans, and very high resolution microscopy (3d movies at micro resolution of cells). These folks often want to load in data, apply image processing filters across that data using map_overlap, and then visually explore the result. They want this processing done with human-in-the-loop systems.

[ ] https://github.com/coiled/coiled-runtime/issues/751

XGBoost

Probably our most common application in ML, folks want to load data into a dask dataframe and then hand off to XGBoost’s Dask integration, possibly with GPUs. They also want to do this with Hyper-Parameter-Optimization.

We already have Guido’s work here at https://github.com/coiled/dask-xgboost-nyctaxi . Maybe we want to extend it with GPUs or cost analysis.

[ ] Train on a large dataset
[ ] Train on a large dataset with HPO with Optuna
[ ] Add GPUs

PyTorch + HyperParameter Optimization

We have Optuna. We use it above for XGBoost but we should also show how to use it in more vanilla settings with a model that can be trained on a single machine, presumably a GPU. Let’s use PyTorch for this.

Train some PyTorch GPU model that fits on a single GPU with Optuna for HPO on a cluster

[ ] https://github.com/coiled/coiled-runtime/issues/759

mrocklin commented 1 year ago

@guillaumeeb we're trying to add some real-world examples to our benchmark suite (as opposed to the more toy-examples there today) that are reflective of common Dask workloads. We're looking for examples roughly like the following. Ideally we'd find examples that are between 20-200 lines of code in terms of complexity. Looking at the list above, can you think of good examples that you've run across while engaging with users?

mrocklin commented 1 year ago

I suspect that @ncclementi has a notebook already for RenRe. Naty can you point to that if you still have it? Did we get clearance to use it in a public setting from them?

ncclementi commented 1 year ago

@mrocklin JD (RenRe) and I collaborated on creating a synthetic dataset that represented the original one. The synthetic data is not public, but it's on the oss-s3. I can make it public if we want.

The repo on how to create the data, and a replication of their workflow (imbalance join) are in this repo https://github.com/coiled/imbalanced-join

I'm happy to chat with whoever will be taking the lead on this to bring them up to speed and facilitate whatever they need.

mrocklin commented 1 year ago

My recollection is that the original RenRe workflow was more than just this one join. It was lots of things. Do we still have that? Is it possible to make that public?

On Fri, Mar 31, 2023 at 12:11 PM Naty Clementi @.***> wrote:

@mrocklin https://github.com/mrocklin JD (RenRe) and I collaborated on creating a synthetic dataset that represented the original one. The synthetic data is not public, but it's on the oss-s3. I can make it public if we want.

The repo on how to create the data, and a replication of their workflow (imbalance join) are in this repo https://github.com/coiled/imbalanced-join

I'm happy to chat with whoever will be taking the lead on this to bring them up to speed and facilitate whatever they need.

— Reply to this email directly, view it on GitHub https://github.com/coiled/coiled-runtime/issues/725#issuecomment-1492288732, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTE3YTYTGORDJHQMZ63W64F5PANCNFSM6AAAAAAWC6JO6M . You are receiving this because you were mentioned.Message ID: @.***>

--

https://coiled.io

Matthew Rocklin CEO

ncclementi commented 1 year ago

@mrocklin we do have those on a private repo, that has multiple things (happy to walk you through what's in there, I'm available this afternoon). When I talked to JD, they mentioned their main issue was the joins shown on the notebook.

mrocklin commented 1 year ago

Can you point to the repository?

On Fri, Mar 31, 2023 at 12:39 PM Naty Clementi @.***> wrote:

@mrocklin https://github.com/mrocklin we do have those on a private repo, that has multiple things (happy to walk you through what's in there, I'm available this afternoon). When I talked to JD, they mentioned their main issue was the joins shown on the notebook.

— Reply to this email directly, view it on GitHub https://github.com/coiled/coiled-runtime/issues/725#issuecomment-1492349516, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTECQGAKLCGPTRNLMGDW64JDNANCNFSM6AAAAAAWC6JO6M . You are receiving this because you were mentioned.Message ID: @.***>

--

https://coiled.io

Matthew Rocklin CEO

mrocklin commented 1 year ago

For the PyTorch + Optuna + GPUs doing a web search here yields not-terrible results. Here is an example (but I'm confident that there are better ones).

@jacobtomlinson @mmccarty @quasiben I don't suppose you all have any interest in finding something here. My guess is that this is much easier for you all (or someone around you) than it is for me personally.

jrbourbeau commented 1 year ago

Moving the HPO conversation into a standalone issue https://github.com/coiled/coiled-runtime/issues/759

guillaumeeb commented 1 year ago

Hi there,

On image processing, there is some complex use case that did not get an answer yet: https://dask.discourse.group/t/parallelize-or-map-chunks-of-arrays-with-different-sizes-shapes-and-number-of-blocks/1663.

Another small example on this topic: https://dask.discourse.group/t/upscaling-an-image-with-dask-image-leads-to-blurry-result/1631/3.

Dask for reading and processing videos: https://dask.discourse.group/t/performing-hog-matrices-on-pims-chunks-through-imageio/570.

I was hoping to find some nice Dataframe + ML workflows, but in the end these kind of topics give only very basic "toy" examples. So after browsing Discourse and Stackoverflow for 20 minutes, I've given up.

coiled / benchmarks

Workflow ideas #725

Data loading and cleaning

Exploratory Analysis

Embarrassing parallel ✅

Imaging

XGBoost

PyTorch + HyperParameter Optimization