benfred / implicit

Fast Python Collaborative Filtering for Implicit Feedback Datasets
https://benfred.github.io/implicit/
MIT License
3.57k stars 612 forks source link

Performance regression with Cython 3.0.0 #678

Closed benfred closed 1 year ago

benfred commented 1 year ago

Cython 3.0.0 has recently been released - and we're seeing this cause a large performance regression over Cython 0.29.21

Running the movielens 100k example (python examples/movielens.py --variant=100k) and the training component takes 121s when compiled with Cython 3.0.0, but only takes 1.9s with Cython 0.29.21 (63x slower). The inference performance seems to be roughly similar here.

Running the ALS unittests (pytest tests/als_test.py ) with Cython 3.0.0 takes 589s where it only takes 28s with Cython 0.29.21 (21x slower).

benfred commented 1 year ago

My current theory is that this is because because of nogil functions that aren't marked noexcept (like in https://cython.readthedocs.io/en/latest/src/userguide/migrating_to_cy30.html#exception-values-and-noexcept - )

benfred commented 1 year ago

Looks like this is from the noexcept issue.

Adding a line with #cython: legacy_implicit_noexcept=True to implicit/cpu/_als.pyx causes the training time on ml100k to drop down to be the same as with Cython v0.29.

It does seem that marking each of the functions as noexcept nogil isn't sufficient (for instance, by applying this change):

--- a/implicit/cpu/_als.pyx
+++ b/implicit/cpu/_als.pyx
@@ -15,40 +15,40 @@ from libc.string cimport memcpy, memset

 # lapack/blas wrappers for cython fused types
 cdef inline void axpy(int * n, floating * da, floating * dx, int * incx, floating * dy,
-                      int * incy) nogil:
+                      int * incy) noexcept nogil:
     if floating is double:
         cython_blas.daxpy(n, da, dx, incx, dy, incy)
     else:
         cython_blas.saxpy(n, da, dx, incx, dy, incy)

 cdef inline void symv(char *uplo, int *n, floating *alpha, floating *a, int *lda, floating *x,
-                      int *incx, floating *beta, floating *y, int *incy) nogil:
+                      int *incx, floating *beta, floating *y, int *incy) noexcept nogil:
     if floating is double:
         cython_blas.dsymv(uplo, n, alpha, a, lda, x, incx, beta, y, incy)
     else:
         cython_blas.ssymv(uplo, n, alpha, a, lda, x, incx, beta, y, incy)

-cdef inline floating dot(int *n, floating *sx, int *incx, floating *sy, int *incy) nogil:
+cdef inline floating dot(int *n, floating *sx, int *incx, floating *sy, int *incy) noexcept nogil:
     if floating is double:
         return cython_blas.ddot(n, sx, incx, sy, incy)
     else:
         return cython_blas.sdot(n, sx, incx, sy, incy)

-cdef inline void scal(int *n, floating *sa, floating *sx, int *incx) nogil:
+cdef inline void scal(int *n, floating *sa, floating *sx, int *incx) noexcept nogil:
     if floating is double:
         cython_blas.dscal(n, sa, sx, incx)
     else:
         cython_blas.sscal(n, sa, sx, incx)

 cdef inline void posv(char * u, int * n, int * nrhs, floating * a, int * lda, floating * b,
-                      int * ldb, int * info) nogil:
+                      int * ldb, int * info) noexcept nogil:
     if floating is double:
         cython_lapack.dposv(u, n, nrhs, a, lda, b, ldb, info)
     else:
         cython_lapack.sposv(u, n, nrhs, a, lda, b, ldb, info)

 cdef inline void gesv(int * n, int * nrhs, floating * a, int * lda, int * piv, floating * b,
-                      int * ldb, int * info) nogil:
+                      int * ldb, int * info) noexcept nogil:
     if floating is double:
         cython_lapack.dgesv(n, nrhs, a, lda, piv, b, ldb, info)
     else:

Applying this change only reduced training time on the ml100k dataset from 121s to 63s.

However there were a bunch of warnings like

warning: scipy/linalg/cython_blas.pxd:20:72: Implicit noexcept declaration is deprecated. Function declaration should contain 'noexcept' keyword.

From the cython_blas / cython_lapack functions we are using . This is probably because I'm compiling with scipy v1.10.1 - which doesn't have this fix applied https://github.com/scikit-learn/scikit-learn/issues/25609.