Closed eric-czech closed 4 years ago
was curious if you guys would be willing to talk a little bit about the state of uarray integration in other PyData projects.
Hopefully I'll be able to say more soon, but as of now, the only use is as a scipy.fft
backend system which is adopted/implemented by CuPy and MKL, among others.
For example, a minority of the functionality we need will be covered directly by the numpy API (e.g. row-wise/column-wise summary statistics) but the majority of it, or at least the harder, more interesting functionality, will involve fairly bespoke algorithms that are specific to the domain and can only dispatch partly through a numpy API, via something like unumpy. What we will need is a way to define how these algorithms work based on the underlying array type and we know that the implementations will be quite different when using Dask, CuPy, or in-memory arrays.
Sounds like a reasonable request.
I imagine we will need our own
DaskBackend
,CuPyBackend
, etc. implementations and though I noticed several warnings from you guys on not building backends that depend on other backends, this seems like an exception. In other words, our GeneticsDaskBackend would need to force use of the unumpy DaskBackend. Did you guys envision this working differently or am I on the right track?
What I'd suggest, more or less, is that you define your own multimethods. There are docs on doing that here. What can then be done on the side of the backend providers is to provide a register_implementation
function of sorts, through which you'd register your implementation to backends.
The reason I don't suggest otherwise is, what if the user does the following in their code:
with ua.set_backend(CupyBackend), ua.set_backend(SKAllelDaskBackend):
pass
It'd almost certainly be bad for performance.
The other (in my mind, suboptimal) approach is that we write something like a determine_backend
function (see #198), which will then return the right backend and allow one to specialize based on it.
I was also wondering if you knew of projects specific to some scientific domain that build on uarray/unumpy to support dispatching. I suppose it's early for that, but I wanted to ask because it would be great to have a model to follow if one exists, rather than working through some prototypes myself.
It is too early for that, currently. One I know of is udiff
, an in-progress autograd library built on top of unumpy
, also under Quansight-Labs
.
If you have any follow-up questions don't be afraid to ask.
Hopefully I'll be able to say more soon, but as of now, the only use is as a scipy.fft
Nice! I'll check it out.
What can then be done on the side of the backend providers is to provide a register_implementation function of sorts
I think I see what you mean there, but I believe I'm looking for something else. For example, I'd like a user to be able to do this:
import genetics_api
ds = genetics_api.read_genetic_data_file(...) # -> xr.Dataset
# I think this needs to activate our SKAllelDaskBackend backend and the one
# used by unumpy -- it may be a better idea to force use of
# ua.set_global_backend / ua.set_backend directly but I'm not sure
genetics_api.set_global_backend('dask')
# or with genetics_api.set_backend('dask'):
# This function would now do stuff using the unumpy API in a lot places,
# but it will also need to use the dask API directly for things like `map_overlap`
# (we don't want to override any of the functionality already in unumpy.DaskBackend)
ds_res = genetics_api.run_analysis(ds)
What I hope that makes more clear is that we wouldn't want to add implementations of existing things in the Dask API. We know that when the user wants "Dask mode", that all of our methods will need to use things that exist only in the Dask API alongside many that can easily be dispatched to Dask via unumpy. I'm thinking something like:
# in genetics_apy.py:
def run_analysis(ds: xr.Dataset) -> xr.Dataset:
new_var_1 = genetics_algo_1(ds['data_var'])
new_var_2 = ... # some other genetics function
return xr.Dataset({'var1': new_var_1, ...})
# This would have different backends and work on array arguments
genetics_algo_1 = ua.generate_multimethod(...)
# Somewhere as part of the SKAllelDaskBackend (downstream from
# xr.DataArray -> da.Array coercion):
def genetics_algo_1(arr: DaskArray, mask: DuckArray) -> DaskArray:
import dask
import unumpy
# Do stuff you can only do with the Dask API and that is super critical for performance
arr = arr.rechunk(...)
res = dask.overlap.map_overlap(arr, ...)
# Also do stuff that would coerce to Dask and return Dask results,
# but is covered by np API
# * This could be a Dask * Numpy multiplication because mask is only 1D, but I
# definitely want the result to remain a Dask array (so I need unumpy.DaskBackend on)
res = unumpy.multiply(res, mask)
return res
Since we would already need to be using the Dask API directly, it seems like it would be a good idea to force use of the unumpy Dask backend as well there no? In that example above, I suppose it would make sense to also keep using the Dask API for things like the res * mask
step, but I'm trying to think of how we could only use project specific APIs where absolutely necessary.
Thanks for the time @hameerabbasi.
@eric-czech Yes, perhaps I can explain a bit more concretely what I meant. You would do the following, ideally, in this case (The run_analysis
method is unchanged):
# Note the "numpy." prefix -- It means any NumPy backend would also dispatch to your function.
genetics_algo_1 = ua.generate_multimethod(..., domain="numpy.sk_allel")
def genetics_algo_1_dask(arr, mask):
import dask.array as da
import unumpy as np
# Do stuff you can only do with Dask
arr = arr.rechunk(...)
res = dask.overlap.map_overlap(arr, ...)
# Do generic stuff
return np.multiply(res, mask)
# This would say to the Dask backend -- "I have my own API, but I'd like to specialize a function in it for Dask"
# Dask would need to provide this, but it's easy to do.
DaskBackend.register(genetics_algo_1, genetics_algo_1_dask)
Whereas, in the second case that I mentioned, it's a bit different:
def genetics_algo_1(arr, mask):
import unumpy as np
# `determine_backend` is an upcoming feature
if isinstance(np.determine_backend(arr), DaskBackend):
return genetics_algo_1_dask(arr, mask)
def genetics_algo_1_dask(arr, mask):
import dask.array as da
import unumpy as np
# Do stuff you can only do with Dask
arr = arr.rechunk(...)
res = dask.overlap.map_overlap(arr, ...)
# Do generic stuff
return np.multiply(res, mask)
This way, just setting the DaskBackend
would be enough to dispatch to your function when it's available, and you do not need wrappers.
Ah I see now, thanks! I'll go that route instead.
I noticed a thread between @hameerabbasi and some Xarray folks in https://github.com/pydata/xarray/issues/1938#issuecomment-510944897 and was curious if you guys would be willing to talk a little bit about the state of uarray integration in other PyData projects. It looks like this was all pretty nascent then and I was disappointed to see that that there aren't any more open issues about working it into Xarray (or so it seems).
I like the idea a lot and I'm trying to understand how to best think about it's usage by scikit developers. More specifically, we work in statistical genetics and are coordinating with a few other developers in the space to help think about the next generation of tools like scikit-allel. A big question for me in the context of uarray is how the Backends would inter-operate between projects if they attempt to address similar problems.
For example, a minority of the functionality we need will be covered directly by the numpy API (e.g. row-wise/column-wise summary statistics) but the majority of it, or at least the harder, more interesting functionality, will involve fairly bespoke algorithms that are specific to the domain and can only dispatch partly through a numpy API, via something like unumpy. What we will need is a way to define how these algorithms work based on the underlying array type and we know that the implementations will be quite different when using Dask, CuPy, or in-memory arrays. I imagine we will need our own
DaskBackend
,CuPyBackend
, etc. implementations and though I noticed several warnings from you guys on not building backends that depend on other backends, this seems like an exception. In other words, our GeneticsDaskBackend would need to force use of the unumpy DaskBackend. Did you guys envision this working differently or am I on the right track?I was also wondering if you knew of projects specific to some scientific domain that build on uarray/unumpy to support dispatching. I suppose it's early for that, but I wanted to ask because it would be great to have a model to follow if one exists, rather than working through some prototypes myself.
Thanks!