Switch from numba compilation to vectorized code

aarmey commented 11 months ago

NumPy is very good at vectorizing large operations, so that they are extremely high performance. Consequently, it is quite common for Python code to actually be faster than numba counterparts if expressions can be adjusted to avoid loops.

I noticed that the code spends most of its time in pacmap_grad(), and that this does not leverage multiple CPU cores. Switching this to vectorized Python code ends up much faster for large datasets.

hyhuang00 commented 11 months ago

Hi Prof. Meyer, thank you so much for contributing to PaCMAP! This is indeed a very interesting and important observation. Before merging, I will run some tests on local machines and quantify the improvements. This merge request will be updated once profiling results are retrieved!

hyhuang00 commented 11 months ago

It seems like the latest commit on placing back numba compile for pacmap_grad() may lead to error. I'm currently commenting it out and running some large scale experiments to check the performance.

Pasting the output below:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/users/hh219/FastPaCMAP/PaCMAP/source/pacmap/__init__.py", line 12, in <module>
    from .pacmap_numpy import PaCMAP as PaCMAPNumpy
  File "/home/users/hh219/FastPaCMAP/PaCMAP/source/pacmap/pacmap_numpy.py", line 233, in <module>
    def pacmap_grad(Y, pair_neighbors, pair_MN, pair_FP, w_neighbors, w_MN, w_FP):
  File "/home/users/hh219/mambaforge/envs/fastviz/lib/python3.10/site-packages/numba/core/decorators.py", line 219, in wrapper
    disp.compile(sig)
  File "/home/users/hh219/mambaforge/envs/fastviz/lib/python3.10/site-packages/numba/core/dispatcher.py", line 965, in compile
    cres = self._compiler.compile(args, return_type)
  File "/home/users/hh219/mambaforge/envs/fastviz/lib/python3.10/site-packages/numba/core/dispatcher.py", line 129, in compile
    raise retval
  File "/home/users/hh219/mambaforge/envs/fastviz/lib/python3.10/site-packages/numba/core/dispatcher.py", line 139, in _compile_cached
    retval = self._compile_core(args, return_type)
  File "/home/users/hh219/mambaforge/envs/fastviz/lib/python3.10/site-packages/numba/core/dispatcher.py", line 152, in _compile_core
    cres = compiler.compile_extra(self.targetdescr.typing_context,
  File "/home/users/hh219/mambaforge/envs/fastviz/lib/python3.10/site-packages/numba/core/compiler.py", line 716, in compile_extra
    return pipeline.compile_extra(func)
  File "/home/users/hh219/mambaforge/envs/fastviz/lib/python3.10/site-packages/numba/core/compiler.py", line 452, in compile_extra
    return self._compile_bytecode()
  File "/home/users/hh219/mambaforge/envs/fastviz/lib/python3.10/site-packages/numba/core/compiler.py", line 520, in _compile_bytecode
    return self._compile_core()
  File "/home/users/hh219/mambaforge/envs/fastviz/lib/python3.10/site-packages/numba/core/compiler.py", line 499, in _compile_core
    raise e
  File "/home/users/hh219/mambaforge/envs/fastviz/lib/python3.10/site-packages/numba/core/compiler.py", line 486, in _compile_core
    pm.run(self.state)
  File "/home/users/hh219/mambaforge/envs/fastviz/lib/python3.10/site-packages/numba/core/compiler_machinery.py", line 368, in run
    raise patched_exception
  File "/home/users/hh219/mambaforge/envs/fastviz/lib/python3.10/site-packages/numba/core/compiler_machinery.py", line 356, in run
    self._runPass(idx, pass_inst, state)
  File "/home/users/hh219/mambaforge/envs/fastviz/lib/python3.10/site-packages/numba/core/compiler_lock.py", line 35, in _acquire_compile_lock
    return func(*args, **kwargs)
  File "/home/users/hh219/mambaforge/envs/fastviz/lib/python3.10/site-packages/numba/core/compiler_machinery.py", line 311, in _runPass
    mutated |= check(pss.run_pass, internal_state)
  File "/home/users/hh219/mambaforge/envs/fastviz/lib/python3.10/site-packages/numba/core/compiler_machinery.py", line 273, in check
    mangled = func(compiler_state)
  File "/home/users/hh219/mambaforge/envs/fastviz/lib/python3.10/site-packages/numba/core/typed_passes.py", line 105, in run_pass
    typemap, return_type, calltypes, errs = type_inference_stage(
  File "/home/users/hh219/mambaforge/envs/fastviz/lib/python3.10/site-packages/numba/core/typed_passes.py", line 83, in type_inference_stage
    errs = infer.propagate(raise_errors=raise_errors)
  File "/home/users/hh219/mambaforge/envs/fastviz/lib/python3.10/site-packages/numba/core/typeinfer.py", line 1086, in propagate
    raise errors[0]
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<built-in function getitem>) found for signature:

 >>> getitem(array(float64, 1d, C), Tuple(slice<a:b>, none))

There are 22 candidate implementations:
      - Of which 20 did not match due to:
      Overload of function 'getitem': File: <numerous>: Line N/A.
        With argument(s): '(array(float64, 1d, C), Tuple(slice<a:b>, none))':
       No match.
      - Of which 2 did not match due to:
      Overload in function 'GetItemBuffer.generic': File: numba/core/typing/arraydecl.py: Line 166.
        With argument(s): '(array(float64, 1d, C), Tuple(slice<a:b>, none))':
       Rejected as the implementation raised a specific error:
         NumbaTypeError: unsupported array index type none in Tuple(slice<a:b>, none)
  raised from /home/users/hh219/mambaforge/envs/fastviz/lib/python3.10/site-packages/numba/core/typing/arraydecl.py:72

During: typing of intrinsic-call at /home/users/hh219/FastPaCMAP/PaCMAP/source/pacmap/pacmap_numpy.py (244)
During: typing of static-get-item at /home/users/hh219/FastPaCMAP/PaCMAP/source/pacmap/pacmap_numpy.py (244)

File "../../../../../home/users/hh219/FastPaCMAP/PaCMAP/source/pacmap/pacmap_numpy.py", line 244:
def pacmap_grad(Y, pair_neighbors, pair_MN, pair_FP, w_neighbors, w_MN, w_FP):
    <source elided>

    grad[pair_neighbors[:, 0]] += w1[:, np.newaxis] * y_ij

hyhuang00 commented 11 months ago

In my experiments, the implementation with numpy will make the algorithm about ~3.5 times slower, on both small and large datasets. I guess there are some differences between the test benches we were using, so I'm posting some information on my end.

I'm using a machine with Intel Xeon Ice Lake Gold 5317 Processors, 3.0GHz 18MB Cache (64GB RAM - 12 cores). Here are the versions of the most important dependencies:

numba                     0.56.0
numpy                     1.22.4  
cython                    0.29.32
scikit-learn              1.1.2

aarmey commented 11 months ago

Thanks for trying this out, @hyhuang00! I never actually timed my code, I was only looking at the profiling, and so I think I ended up with a misleading impression of the difference based on how the sampling data was collected. When I time my version vs. master my timing agrees with your observation.

Sorry for the trouble! I still think there should be a way to speed up this function since it only uses one CPU core. However, I'll more carefully benchmark my code before providing a new version.

hyhuang00 commented 11 months ago

No worries! Again, thank you so much for contributing to PaCMAP. I didn't notice that the numba compiled code was not using multiple cores as intended, and I think there will be ways to make this algorithm even faster.

YingfanWang / PaCMAP

Switch from numba compilation to vectorized code #63