Upstream uarray integration

eric-czech commented 4 years ago

I noticed a thread between @hameerabbasi and some Xarray folks in https://github.com/pydata/xarray/issues/1938#issuecomment-510944897 and was curious if you guys would be willing to talk a little bit about the state of uarray integration in other PyData projects. It looks like this was all pretty nascent then and I was disappointed to see that that there aren't any more open issues about working it into Xarray (or so it seems).

I like the idea a lot and I'm trying to understand how to best think about it's usage by scikit developers. More specifically, we work in statistical genetics and are coordinating with a few other developers in the space to help think about the next generation of tools like scikit-allel. A big question for me in the context of uarray is how the Backends would inter-operate between projects if they attempt to address similar problems.

For example, a minority of the functionality we need will be covered directly by the numpy API (e.g. row-wise/column-wise summary statistics) but the majority of it, or at least the harder, more interesting functionality, will involve fairly bespoke algorithms that are specific to the domain and can only dispatch partly through a numpy API, via something like unumpy. What we will need is a way to define how these algorithms work based on the underlying array type and we know that the implementations will be quite different when using Dask, CuPy, or in-memory arrays. I imagine we will need our own DaskBackend, CuPyBackend, etc. implementations and though I noticed several warnings from you guys on not building backends that depend on other backends, this seems like an exception. In other words, our GeneticsDaskBackend would need to force use of the unumpy DaskBackend. Did you guys envision this working differently or am I on the right track?

I was also wondering if you knew of projects specific to some scientific domain that build on uarray/unumpy to support dispatching. I suppose it's early for that, but I wanted to ask because it would be great to have a model to follow if one exists, rather than working through some prototypes myself.

Thanks!

hameerabbasi commented 4 years ago

was curious if you guys would be willing to talk a little bit about the state of uarray integration in other PyData projects.

Hopefully I'll be able to say more soon, but as of now, the only use is as a scipy.fft backend system which is adopted/implemented by CuPy and MKL, among others.

For example, a minority of the functionality we need will be covered directly by the numpy API (e.g. row-wise/column-wise summary statistics) but the majority of it, or at least the harder, more interesting functionality, will involve fairly bespoke algorithms that are specific to the domain and can only dispatch partly through a numpy API, via something like unumpy. What we will need is a way to define how these algorithms work based on the underlying array type and we know that the implementations will be quite different when using Dask, CuPy, or in-memory arrays.

Sounds like a reasonable request.

I imagine we will need our own DaskBackend, CuPyBackend, etc. implementations and though I noticed several warnings from you guys on not building backends that depend on other backends, this seems like an exception. In other words, our GeneticsDaskBackend would need to force use of the unumpy DaskBackend. Did you guys envision this working differently or am I on the right track?

What I'd suggest, more or less, is that you define your own multimethods. There are docs on doing that here. What can then be done on the side of the backend providers is to provide a register_implementation function of sorts, through which you'd register your implementation to backends.

The reason I don't suggest otherwise is, what if the user does the following in their code:

with ua.set_backend(CupyBackend), ua.set_backend(SKAllelDaskBackend):
    pass

It'd almost certainly be bad for performance.

The other (in my mind, suboptimal) approach is that we write something like a determine_backend function (see #198), which will then return the right backend and allow one to specialize based on it.

I was also wondering if you knew of projects specific to some scientific domain that build on uarray/unumpy to support dispatching. I suppose it's early for that, but I wanted to ask because it would be great to have a model to follow if one exists, rather than working through some prototypes myself.

It is too early for that, currently. One I know of is udiff, an in-progress autograd library built on top of unumpy, also under Quansight-Labs.

hameerabbasi commented 4 years ago

If you have any follow-up questions don't be afraid to ask.

eric-czech commented 4 years ago

Hopefully I'll be able to say more soon, but as of now, the only use is as a scipy.fft

Nice! I'll check it out.

What can then be done on the side of the backend providers is to provide a register_implementation function of sorts

I think I see what you mean there, but I believe I'm looking for something else. For example, I'd like a user to be able to do this:

import genetics_api

ds = genetics_api.read_genetic_data_file(...) # -> xr.Dataset

# I think this needs to activate our SKAllelDaskBackend backend and the one 
# used by unumpy --  it may be a better idea to force use of 
# ua.set_global_backend / ua.set_backend directly but I'm not sure
genetics_api.set_global_backend('dask') 

# or with genetics_api.set_backend('dask'):

# This function would now do stuff using the unumpy API in a lot places, 
# but it will also need to use the dask API directly for things like `map_overlap`
# (we don't want to override any of the functionality already in unumpy.DaskBackend)
ds_res = genetics_api.run_analysis(ds)

What I hope that makes more clear is that we wouldn't want to add implementations of existing things in the Dask API. We know that when the user wants "Dask mode", that all of our methods will need to use things that exist only in the Dask API alongside many that can easily be dispatched to Dask via unumpy. I'm thinking something like:

# in genetics_apy.py:

def run_analysis(ds: xr.Dataset) -> xr.Dataset:
    new_var_1 = genetics_algo_1(ds['data_var'])
    new_var_2 = ... # some other genetics function
    return xr.Dataset({'var1': new_var_1, ...})

# This would have different backends and work on array arguments
genetics_algo_1 = ua.generate_multimethod(...)

# Somewhere as part of the SKAllelDaskBackend (downstream from 
# xr.DataArray -> da.Array coercion):

def genetics_algo_1(arr: DaskArray, mask: DuckArray) -> DaskArray:
  import dask
  import unumpy

  # Do stuff you can only do with the Dask API and that is super critical for performance
  arr = arr.rechunk(...)
  res = dask.overlap.map_overlap(arr, ...)

  # Also do stuff that would coerce to Dask and return Dask results, 
  # but is covered by np API
  # * This could be a Dask * Numpy multiplication because mask is only 1D, but I 
  # definitely want the result to remain a Dask array (so I need unumpy.DaskBackend on)
  res = unumpy.multiply(res, mask)

  return res

Since we would already need to be using the Dask API directly, it seems like it would be a good idea to force use of the unumpy Dask backend as well there no? In that example above, I suppose it would make sense to also keep using the Dask API for things like the res * mask step, but I'm trying to think of how we could only use project specific APIs where absolutely necessary.

Thanks for the time @hameerabbasi.

hameerabbasi commented 4 years ago

@eric-czech Yes, perhaps I can explain a bit more concretely what I meant. You would do the following, ideally, in this case (The run_analysis method is unchanged):

# Note the "numpy." prefix -- It means any NumPy backend would also dispatch to your function.
genetics_algo_1 = ua.generate_multimethod(..., domain="numpy.sk_allel")

def genetics_algo_1_dask(arr, mask):
    import dask.array as da
    import unumpy as np

    # Do stuff you can only do with Dask
    arr = arr.rechunk(...)
    res = dask.overlap.map_overlap(arr, ...)

    # Do generic stuff
    return np.multiply(res, mask)

# This would say to the Dask backend -- "I have my own API, but I'd like to specialize a function in it for Dask"
# Dask would need to provide this, but it's easy to do.
DaskBackend.register(genetics_algo_1, genetics_algo_1_dask)

Whereas, in the second case that I mentioned, it's a bit different:

def genetics_algo_1(arr, mask):
    import unumpy as np
    # `determine_backend` is an upcoming feature
    if isinstance(np.determine_backend(arr), DaskBackend):
        return genetics_algo_1_dask(arr, mask)

def genetics_algo_1_dask(arr, mask):
    import dask.array as da
    import unumpy as np

    # Do stuff you can only do with Dask
    arr = arr.rechunk(...)
    res = dask.overlap.map_overlap(arr, ...)

    # Do generic stuff
    return np.multiply(res, mask)

hameerabbasi commented 4 years ago

This way, just setting the DaskBackend would be enough to dispatch to your function when it's available, and you do not need wrappers.

eric-czech commented 4 years ago

Ah I see now, thanks! I'll go that route instead.

Quansight-Labs / uarray

Upstream uarray integration #241