CNugteren / CLBlast

Tuned OpenCL BLAS
Apache License 2.0
1.05k stars 205 forks source link

Banded matrices required buffer size calculated incorrectly (GBMV, HBMV, SBMV & TBMV) #538

Open BadgerKing7 opened 5 months ago

BadgerKing7 commented 5 months ago

GBMV, HBMV and SBMV use MatVec directly, passing their own buffers into the function, unlike TBMV which passes a scratch buffer for just vector X into MatVec. This means that for all 4 of these operations TestMatrixA is applied directly on the user provided buffer for the A matrix.

The required buffer size of all A matrices is calculated as (ld * (two - 1) + one + offset) * sizeof(T) which for GBMV, HBMV, SBMV and TBMV called with row major matrices expands to (a_ld * (m - 1) + kl + ku + 1 + a_offset) * sizeof(T). However, the BLAS interface defines the size of matrix A for all 4 routines as a_ld * n which means that user buffers can lead to kInsufficientMemoryA errors even though they hold enough memory according to the interface spec.

For HBMV, SBMV and TBMV this issue is somewhat hidden by passing the value of n into MatVec's m parameter so the error is only encountered when (kl + ku + 1) > a_ld.

CNugteren commented 4 months ago

Thank you for reporting this issue, and apologies for the late reply. I think I understand most of what you are saying, but I'm not really sure what you mean with the BLAS interface and the interface spec. Do you refer to something inside CLBlast, or something like the Netlib BLAS specification or so (e.g. this page)?

BTW, I'm open to reviewing fixes for this issue if you have something in mind.

BadgerKing7 commented 4 months ago

Do you refer to something inside CLBlast, or something like the Netlib BLAS specification or so (e.g. this page)?

I apologize for the lack of clarity, I was referring to the Netlib documentation to which you linked. On that page, the relevant snippet is: image

BTW, I'm open to reviewing fixes for this issue if you have something in mind.

I will be looking for a fix and opening a pull request when I have something worth reviewing, but I cannot promise a timeframe since doing it without affecting other level 2 operations seems to be harder than appears at first look because the MatVec function is the one that needs changing. I have tried only changing input validation logic to allow for buffers of strictly a_ld * n to pass but doing this results in incorrect outputs afterwards. I will update when I have a way for you to reproduce this behavior.

CNugteren commented 3 months ago

seems to be harder than appears at first look because the MatVec function is the one that needs changing

What about adding an extra argument to optionally disable the matrix A check in MatVec and simply add an extra check in the relevant routines? Something along this patch:

diff --git a/src/routines/level2/xgbmv.cpp b/src/routines/level2/xgbmv.cpp
index e80b9a9..c37ba10 100644
--- a/src/routines/level2/xgbmv.cpp
+++ b/src/routines/level2/xgbmv.cpp
@@ -42,6 +42,9 @@ void Xgbmv<T>::DoGbmv(const Layout layout, const Transpose a_transpose,
   auto kl_real = (rotated) ? ku : kl;
   auto ku_real = (rotated) ? kl : ku;

+  // The matrix A has different constraints compared to what is normally tested in MatVec below
+  TestMatrixBanded(n, kl, ku, a_buffer, a_offset, a_ld);
+
   // Runs the generic matrix-vector multiplication, disabling the use of fast vectorized kernels.
   // The specific hermitian matrix-accesses are implemented in the kernel guarded by the
   // ROUTINE_GBMV define.
@@ -52,7 +55,8 @@ void Xgbmv<T>::DoGbmv(const Layout layout, const Transpose a_transpose,
          x_buffer, x_offset, x_inc, beta,
          y_buffer, y_offset, y_inc,
          fast_kernels, fast_kernels,
-         0, false, kl_real, ku_real);
+         0, false, kl_real, ku_real,
+         /*do_test_matrix_a=*/false);
 }

 // =================================================================================================
diff --git a/src/routines/level2/xgemv.cpp b/src/routines/level2/xgemv.cpp
index 63dab9f..9a5761c 100644
--- a/src/routines/level2/xgemv.cpp
+++ b/src/routines/level2/xgemv.cpp
@@ -64,7 +64,7 @@ void Xgemv<T>::MatVec(const Layout layout, const Transpose a_transpose,
                       const Buffer<T> &y_buffer, const size_t y_offset, const size_t y_inc,
                       bool fast_kernel, bool fast_kernel_rot,
                       const size_t parameter, const bool packed,
-                      const size_t kl, const size_t ku) {
+                      const size_t kl, const size_t ku, const bool do_test_matrix_a) {

   // Makes sure all dimensions are larger than zero
   if (m == 0 || n == 0) { throw BLASError(StatusCode::kInvalidDimension); }
@@ -92,7 +92,7 @@ void Xgemv<T>::MatVec(const Layout layout, const Transpose a_transpose,

   // Tests the matrix and the vectors for validity
   if (packed) { TestMatrixAP(n, a_buffer, a_offset); }
-  else { TestMatrixA(a_one, a_two, a_buffer, a_offset, a_ld); }
+  else if (do_test_matrix_a) { TestMatrixA(a_one, a_two, a_buffer, a_offset, a_ld); }
   TestVectorX(n_real, x_buffer, x_offset, x_inc);
   TestVectorY(m_real, y_buffer, y_offset, y_inc);

diff --git a/src/routines/level2/xgemv.hpp b/src/routines/level2/xgemv.hpp
index 1e1fa72..1ba5dde 100644
--- a/src/routines/level2/xgemv.hpp
+++ b/src/routines/level2/xgemv.hpp
@@ -46,7 +46,7 @@ class Xgemv: public Routine {
               const Buffer<T> &y_buffer, const size_t y_offset, const size_t y_inc,
               bool fast_kernel, bool fast_kernel_rot,
               const size_t parameter, const bool packed,
-              const size_t kl, const size_t ku);
+              const size_t kl, const size_t ku, const bool do_test_matrix_a = true);
 };

 // =================================================================================================
diff --git a/src/utilities/buffer_test.hpp b/src/utilities/buffer_test.hpp
index 4a2a2c9..9097a21 100644
--- a/src/utilities/buffer_test.hpp
+++ b/src/utilities/buffer_test.hpp
@@ -65,6 +65,17 @@ void TestMatrixAP(const size_t n, const Buffer<T> &buffer, const size_t offset)
   } catch (const Error<std::runtime_error> &e) { throw BLASError(StatusCode::kInvalidMatrixA, e.what()); }
 }

+// Tests matrix 'A' (for banded matrix-vector computations) for validity
+template <typename T>
+void TestMatrixBanded(const size_t n, const size_t kl, const size_t ku, const Buffer<T> &buffer,
+                      const size_t offset, const size_t ld, const bool test_lead_dim = true) {
+  if (test_lead_dim && ld < kl + ku) { throw BLASError(StatusCode::kInvalidLeadDimA); }
+  try {
+    const auto required_size = (ld * n + offset) * sizeof(T);
+    if (buffer.GetSize() < required_size) { throw BLASError(StatusCode::kInsufficientMemoryA); }
+  } catch (const Error<std::runtime_error> &e) { throw BLASError(StatusCode::kInvalidMatrixA, e.what()); }
+}
+
 // =================================================================================================

 // Tests vector 'X' for validity
andreimatraguna commented 2 weeks ago

Hi CNugteren, this is a good idea, I implemented what you proposed, can you check if its all good? commit SHA 864cb778bca89279d927ae8c71b969baaeb04535

CNugteren commented 2 weeks ago

Can you make a PR to this repository? And then also ask @BadgerKing7 for a review.

andreimatraguna commented 2 weeks ago

CNugteren, can u give me access to publish the branch? (I asked @BadgerKing7, his waiting to review my changes after I create the PR)

CNugteren commented 2 weeks ago

The normal flow to make a PR to a repository you don't have write permissions to (such as CLBlast) is to fork the repository, make your changes (possibly on a branch), push your changes, and then make a PR to this repository from your fork. If you need help, I'm sure there's a guide about this somewhere on GitHub.

andreimatraguna commented 2 weeks ago

@BadgerKing7 can u look at this PR https://github.com/CNugteren/CLBlast/pull/559