scanpy_funcs: Use multi-target regression

ahendriksen commented 1 year ago

cuML will gain support for multi-target regression in version 22.12. This speeds up the code significantly since the regression no longer has to be performed inside a for loop. As described in https://github.com/rapidsai/cuml/pull/4988, the speedup can be up to 50x.

This PR adds support for using the multi-target regression. To prevent running out of memory when the input is too large (more than 100000 rows), the existing iterative code path is used.

ahendriksen commented 1 year ago

For a representative use case, this PR speeds up the regression step by a factor of ~20.

import cupy as cp
from cuml.linear_model import LinearRegression
from time import perf_counter as timer
from contextlib import contextmanager

@contextmanager
def bench(name):
    # Code to acquire resource, e.g.:
    start = timer()
    yield
    duration = timer() - start
    print(f"{name}: {duration:0.2f} seconds")

n_cells = 91_000
n_genes = 5_000

normalized = cp.random.normal(size=(n_cells, n_genes))
n_counts = cp.random.normal(size=n_cells)
percent_mito =  cp.random.normal(size=n_cells)

with bench("current regress_out"):
    regress_out(normalized, n_counts, percent_mito, verbose=True) # current

with bench("new regress_out"):
    regress_out(normalized, n_counts, percent_mito, verbose=True) # PR

Results on a Volta V100 GPU:

Regressed 0 out of 5000
[... snip ...]
Regressed 4500 out of 5000

current regress_out: 16.75 seconds
new regress_out:      0.73 seconds

Intron7 commented 1 year ago

Dear Allard, great work this works amazingly well. Do you think it would also work if we created a chunck_size parameter to run like 100 genes at a time even for lager than 100k datasets?

Intron7 commented 1 year ago

So I implement this chunking already for my repo rapids_singlecell and it works really well. I'm still figuring out the details of the syntax, because its still pretty clunky I could write something similar for @ahendriksen PR.

cjnolet commented 1 year ago

I could write something similar for @ahendriksen PR.

@Intron7 if you have time to do that, it would certainly be really useful to show off.

Intron7 commented 1 year ago

@cjnolet @ahendriksen done.

I created a PR for @ahendriksen branch with the batching update.

I have to test tomorrow if we have to change the code for the multigpu notebook to set the batchsize argument there to None. The default is now a batchsize of 100. But i can change that to whatever you want.

ahendriksen commented 1 year ago

Thank you for adding the batching update @Intron7!

@ahendriksen this repository hasn't been updated for RAPIDS 22.12 yet. Can you update the environment yaml files in conda/environment/*.yaml and the Dockerfile and verify the notebooks still run successfully?

I have tested that the code / notebook in this PR runs successfully on RAPIDS 22.12. I have not tested if all other notebooks continue to run successfully.

Intron7 commented 1 year ago

@ahendriksen @cjnolet Perfect. I think this in a hugh improvement since this was one of the slowest parts of the analysis

cjnolet commented 1 year ago

I have to test tomorrow if we have to change the code for the multigpu notebook to set the batchsize argument there to None. The default is now a batchsize of 100. But i can change that to whatever you want.

@Intron7 were you able to check this? Does it run successfully for you? If so, I'm okay merging this and I can update the conda yaml file in a follow-on PR.

Intron7 commented 1 year ago

I found one error within the UVM 1million notebook. Its a memory error with cupy and the cubin hash in the scale function. The standard scaler function works but is a lot slower.

---------------------------------------------------------------------------
CUDADriverError                           Traceback (most recent call last)
File <timed exec>:1

File cupy/_core/core.pyx:1112, in cupy._core.core._ndarray_base.mean()

File cupy/_core/core.pyx:1121, in cupy._core.core._ndarray_base.mean()

File cupy/_core/_routines_statistics.pyx:148, in cupy._core._routines_statistics._ndarray_mean()

File cupy/_core/_kernel.pyx:1310, in cupy._core._kernel.ufunc.__call__()

File cupy/_core/_kernel.pyx:1337, in cupy._core._kernel.ufunc._get_ufunc_kernel()

File cupy/_core/_kernel.pyx:1029, in cupy._core._kernel._get_ufunc_kernel()

File cupy/_core/_kernel.pyx:72, in cupy._core._kernel._get_simple_elementwise_kernel()

File cupy/_core/core.pyx:2232, in cupy._core.core.compile_with_cache()

File ~/conda/envs/rapids-22.12/lib/python3.9/site-packages/cupy/cuda/compiler.py:493, in _compile_module_with_cache(source, options, arch, cache_dir, extra_source, backend, enable_cooperative_groups, name_expressions, log_stream, jitify)
    489     return _compile_with_cache_hip(
    490         source, options, arch, cache_dir, extra_source, backend,
    491         name_expressions, log_stream, cache_in_memory)
    492 else:
--> 493     return _compile_with_cache_cuda(
    494         source, options, arch, cache_dir, extra_source, backend,
    495         enable_cooperative_groups, name_expressions, log_stream,
    496         cache_in_memory, jitify)

File ~/conda/envs/rapids-22.12/lib/python3.9/site-packages/cupy/cuda/compiler.py:562, in _compile_with_cache_cuda(source, options, arch, cache_dir, extra_source, backend, enable_cooperative_groups, name_expressions, log_stream, cache_in_memory, jitify)
    560             cubin_hash = _hash_hexdigest(cubin).encode('ascii')
    561             if hash == cubin_hash:
--> 562                 mod.load(cubin)
    563                 return mod
    564 else:
    565     # Enforce compiling -- the resulting kernel will be cached elsewhere,
    566     # so we do nothing

File cupy/cuda/function.pyx:264, in cupy.cuda.function.Module.load()

File cupy/cuda/function.pyx:266, in cupy.cuda.function.Module.load()

File cupy_backends/cuda/api/driver.pyx:210, in cupy_backends.cuda.api.driver.moduleLoadData()

File cupy_backends/cuda/api/driver.pyx:60, in cupy_backends.cuda.api.driver.check_status()

CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered

Intron7 commented 1 year ago

@cjnolet The Multi_GPU notebook runs perfectly fine with the default of running 100 genes at once during regress_out.

cjnolet commented 1 year ago

@Intron7 what hardware did you use to test the UVM notebook? I've found that illegal memory error can commonly happen when 1) precision overflow for an indexing type (like using an int for nnz in sparse data) or 2) when the memory is oversubscribed by too much.

Intron7 commented 1 year ago

@cjnolet I ran it on a node with an 80GB A100 PCIe and on one of the 80GB A100s in the dgx both the same error.

cjnolet commented 1 year ago

Thanks for verifying that, @Intron7! I'll try and take a little deeper look as well

Intron7 commented 1 year ago

Thanks for verifying that, @Intron7! I'll try and take a little deeper look as well

It could also be our badly setup GPUs. So if this notebooks work on your end I think its us.

NVIDIA-Genomics-Research / rapids-single-cell-examples

scanpy_funcs: Use multi-target regression #104