cupy / cupy

NumPy & SciPy for GPU
https://cupy.dev
MIT License
9.52k stars 856 forks source link

[Tracker] Implement all `scipy.*` APIs in CuPy #6324

Open kmaehashi opened 2 years ago

kmaehashi commented 2 years ago

Implement GPU version of scipy.* functions in cupyx.scipy.* namespace.

This is a tracker issue that summarizes the implementation status of each SciPy public module in CuPy. See the comparison table for details.

Legends

List of Modules

Note: Several modules in SciPy such as scipy.cluster.hierarchy accepts CuPy ndarrays as inputs through the Array API standard.

Starter Task

If you are new to CuPy and wondering where to start, here are some good first things to try. They are relatively simple and independent.

Steps to Contribute

  1. Fork and star :star: the CuPy repository :wink:

  2. Pick a function you want to work on from any of the modules listed above. You can find the function in the SciPy API Reference to understand what should be implemented. If you want to work on new modules (🥚), first discuss with us on this issue.

  3. Implement a function in your branch. If you need help, join Gitter or just ask for help in this issue.

  4. Don't forget to write a test code!

  5. Build CuPy and run tests to confirm that the function runs fine: pip install -e . && pytest tests/cupyx_tests/scipy_tests/PATH_TO_YOUR_TEST See the Contribution Guide for details.

  6. Submit a pull-request to the main branch.

Note that you will need a GPU environment to develop CuPy.

See also: NumPy API Tracker Issue #6078

leofang commented 2 years ago

cc: @cjnolet @quasiben

rgommers commented 2 years ago

This looks great, thanks @kmaehashi!

Miscellaneous routines (scipy.misc)

Note that we're quite likely deprecating this module in SciPy 1.9.0 (see https://github.com/scipy/scipy/issues/15608), so I suggest not working on that.

kmaehashi commented 2 years ago

Thanks for the info, @rgommers! Updated the list.

cjnolet commented 2 years ago

@kmaehashi, RAFT contains hierarchical clustering code as well: https://github.com/cupy/cupy/issues/3434.

RAFT also contains functions for computing contingency table and sampling from various distributions. Wondering if those might be useful here as well.

kmaehashi commented 2 years ago

Thanks @cjnolet! Added a link to the issue.

Pranavchiku commented 1 year ago

This is brilliant and interesting! I am Pranav, developer @lfortran, we are trying to compile scipy using LFortran, we are very close, almost got scipy.special.specfun working. I wish to ask, if we (cupy contributors) are translating fortran implementation present in SciPy to make it use cupy apis, right?

kmaehashi commented 1 year ago

Hi @Pranavchiku! Yes, we are rewriting SciPy code into CUDA C/C++. You'll get the idea by exploring our code base under cupyx/scipy: https://github.com/cupy/cupy/tree/main/cupyx/scipy/special

Pranavchiku commented 1 year ago

Note that you will need a GPU environment to develop CuPy.

Do we need explicit GPU environment, or integrated GPUs work? By integrated I mean Apple GPU that is available with mac m1, m2, etc. chips.

kmaehashi commented 1 year ago

Discrete GPUs are required, currently we support CUDA and HIP (for AMD). https://docs.cupy.dev/en/latest/overview.html

Zhyrek commented 1 year ago

Hey, I need non-batched BLAS functions for what I'm working on, so I will be working on implementing those in the near future. Just wondering if there are any thoughts on a best-practices for implementing these.

Roughly how I imagine using these functions (collapsible code section): ```py @jit.rawkernel(device=True) def square(a): matrix2 = cupyx.jit.local_memory(float, (2, 2)) #I'll also try implementing multi-dimensional local/shared arrays matrix2[0][0] = a matrix2[1][1] = a return cupyx.scipy.linalg.det(matrix2) #can be used in device functions @jit.rawkernel() def elementwise_2_x_squared(x, size): #computes 2*x^2, toy example which uses linalg functions tid = jit.grid(1) ntid = jit.gridsize(1) matrix = cupyx.jit.local_memory(float, (2, 2)) for i in range(tid, size, ntid): stored_val = x[i] x[i] = square(stored_val) matrix[0][0] = stored_val matrix[1][1] = stored_val x[i] += cupyx.scipy.linalg.det(matrix) #can be used inside raw kernel c = cp.array([1., 2.]) elementwise_2_x_squared((256,), (256,), (c, 2)) print(c) #should return [2., 8.] ```

The main use case I have is roughly: I have a large array of scalar values (could be hundreds of millions). For each value, I need to generate a (potentially not small) matrix, solve it, and then use the result to compute a scalar output value. For batched code, I would need to store these matrices altogether (potentially hundreds of GB memory - not feasible), while for non-batched, dynamically generated matrices, the memory requirements would just be the scalar values.

rcremese commented 1 year ago

Hello, I'm quite new to the open-source community but I would like to contribute to this project. Is anybody on the integration of scipy.constants module ?

ev-br commented 7 months ago

It seems possible to mark scipy.interpolate, scipy.signal and scipy.signal.windows as :shipit: now?

ev-br commented 7 months ago

Also, I wonder if the comparison table (super-useful BTW!) , https://docs.cupy.dev/en/stable/reference/comparison.html#scipy-cupy-apis is updated automatically on a release, or is it manual?

Would be great to remove SciPy deprecated items (signal.cmplx_sort, windows from scipy.signal namespace, linalg.tri{l,u} etc)

Pranavchiku commented 7 months ago

Hey all, I was always willing to contribute here and I have my first PR https://github.com/cupy/cupy/pull/8305 opened that ports scipy.stats.kstat. I do not have access to GPUs hence have not tested locally, I followed boxcox_llf and it was fun and easy :)

kmaehashi commented 7 months ago

It seems possible to mark scipy.interpolate, scipy.signal and scipy.signal.windows as :shipit: now?

Good catch! Updated the list :shipit:

Also, I wonder if the comparison table (super-useful BTW!) , https://docs.cupy.dev/en/stable/reference/comparison.html#scipy-cupy-apis is updated automatically on a release, or is it manual?

latest page is updated automatically for each commit merged into main branch. stable page is updated automatically for each stable (v13.*) release.

Would be great to remove SciPy deprecated items (signal.cmplx_sort, windows from scipy.signal namespace, linalg.tri{l,u} etc)

The page is generated with this tool. Deprecated items are currently listed in footnotes section to describe why we don't implement them. https://github.com/cupy/cupy/blob/1aa70f0de8c78afcfa503837d9baee3d80af4bc8/docs/source/_comparison_generator.py#L206-L207

kmaehashi commented 7 months ago

Hey all, I was always willing to contribute here and I have my first PR #8305 opened that ports scipy.stats.kstat. I do not have access to GPUs hence have not tested locally, I followed boxcox_llf and it was fun and easy :)

Hi @Pranavchiku, thank you for your interest in contributing! We, however, would like to kindly ask all contributors to build and test the branch before submitting a pull request. I understand that there are situations where access to GPUs is difficult but in that case please consider using offerings such as Google Colab.

ev-br commented 7 months ago

Thanks @kmaehashi A couple of comments, now that I looked at this tracker a bit more:

kmaehashi commented 6 months ago

Thanks for the suggestions @ev-br!

  • With KDTree and Delaunay implemented, maybe scipy.spatial is fair to be marked as :shipit:, too. Submodules (spatial.transform and spatial.distance might be a separate story, don't have an opinion)

Absolutely, marked :shipit:!

  • scipy.linalg.{blas, lapack} are unlikely to be implemented at all, correct?

Can't really say they are unlikely at all, as we have non-public (undocumented) cupyx.lapack module. I think we can say they're considered low-priority, though.

  • even more so for scipy.linalg.cython_{blas,lapack}. Maybe even add an "out of scope" type symbol, ❌ or some such?

Sounds great, marked ❌.

  • scipy.io --- is there an appetite for ever adding functionality for reading from matlab to CuPy? I suppose using scipy.io followed by cupy.asarray is the way, is it?

I began to wonder if I could delete it from the list, as the audience is very limited compared to NumPy's I/O routines (e.g., cupy.load which we implemented by wrapping numpy.load).

  • Several SciPy subpackages support CuPy array inputs via Array API, e.g. scipy.cluster. It might be nice to show these too here?

Nice, added a link to the Array API support docs!

craigwarner-ufastro commented 3 months ago

@kmaehashi hi, first time contributor here. I've just implemented a vectorized version of scipy.optimize.bvls (bounded value least squares, called through scipy.optimize.leastsq with method='bvls') for an astronomical data pipeline. lsq_linear solves the least squares problem

        minimize 0.5 * ||A x - b||**2
        subject to lb <= x <= ub

Where A has dimensions (i,j); b has dimension (i) and x has dimension (j). My implementation vectorizes this to solve N identical shape matrices by making A => (N,i,j); b => (N, i) and x => (N, j) where a speed gain is obtained for larger N of at least a few hundred.

I would love to contribute this to CuPy but wanted to check first since you don't seem to have a cupyx.scipy.optimize module which is where I think it should go to mirror scipy. Note I have not implemented the trf or lsmr algorithms within lsq_linear so these return currently a NotImplementedError and only BVLS is implemented. Cheers!

jakirkham commented 3 months ago

There is some discussion about cupyx.scipy.optimize in issue: https://github.com/cupy/cupy/issues/6112

dbolser commented 2 months ago

So... CuPy may be the wrong solution for my needs, but I have an array floats of len ~ 500,000 with a variable number (0.00001% to 10%) of 'affected' vs. 'normal' samples (about 10k of them). I want to run the permutation test on this large array, and it's taking minutes for as few as 1k permutations... I get the feeling I should rather sub-sample the normals and then permute them, but how many sub-samples do I take, etc....

Long story short, I'd like to implement scipy.stats.permutation_test in CuPy! Since you said to ask here first... here I am ;-)

Perhaps I can get away with just using CuPy arrays and passing 'native' scipy.stats.permutation_test a cupy function as the callable statistic? e.g. cupy.ndarray.mean?

I guess I should try that first 😂

Several mentions of RAFT above. I can't figure out if that's useful for me or not.

mdhaber commented 2 months ago

Most SciPy stats functions are being made array API compatible, so CuPy does not need to re-implement them. There are a few that we might ask for CuPy to implement because they cannot be implemented in terms of array API operations and common special functions, but permutation_test is probably not one of them.

Perhaps I can get away with just using CuPy arrays and passing 'native' scipy.stats.permutation_test a cupy function as the callable statistic

You'll need to wrap it so it returns a NumPy array, but yes - if computing the statistic function is the bottleneck, that will probably speed things up.

dbolser commented 1 month ago

Perhaps I can get away with just using CuPy arrays and passing 'native' scipy.stats.permutation_test a cupy function as the callable statistic

You'll need to wrap it so it returns a NumPy array, but yes - if computing the statistic function is the bottleneck, that will probably speed things up.

Maybe I misunderstood what you mean't by 'wrap'... If I try to pass cupy arrays into permutation test, it immediately complains:

TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly.

Seems there is no speed improvement over numpy on simple 'wrapped' statistics:


def compare_ttest_permu(rp_scores_np: np.ndarray, rn_scores_np: np.ndarray) -> None:
    for n_resamples in [10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10000]:
        print(f"PERM : {n_resamples}")

        res = permutation_test(
            (rp_scores_np, rn_scores_np),
            statistic=mean_ind_cp,
            permutation_type="independent",
            n_resamples=n_resamples,
            alternative="two-sided",
            vectorized=True,
        )

def mean_ind_cp(x: cp.ndarray, y: cp.ndarray, axis=0):
    cp_x = cp.array(x)
    cp_y = cp.array(y)

    return cp.mean(cp_x, axis=axis) - cp.mean(cp_y, axis=axis)