Efficiently generate submatrices of dense sketching operators

rileyjmurray commented 1 year ago

Context

LSKGE3 accepts input that defines a sketching operator $S_0$. It lets you apply a submatrix of $S_0$ if you'd like: $B = \mathrm{submat}(S_0) * A$, where the particular submatrix of $S_0$ is determined by other function arguments.

Example: you have "metadata" that defines $S_0$, but you only want to compute $S A$ where $S = S_0[100:200, :]$.
From a user's perspective, there is no requirement that the entire sketching operator is generated explicitly in memory. Whether or not that happens is an implementation detail.

The current implementation is naive. You can see the naivety by looking first at the line

https://github.com/BallisticLA/RandBLAS/blob/68116d6825ee89d7b0af911dfb0d39234e3d98dd/RandBLAS/dense.hh#L503

which generates the entire sketching operator $S_0$, even though we only end up needing $\mathrm{submat}(S_0)$.
The next place that one can see the naivety is how we call GEMM. Specifically, at line

https://github.com/BallisticLA/RandBLAS/blob/68116d6825ee89d7b0af911dfb0d39234e3d98dd/RandBLAS/dense.hh#L543

we have to pass a pointer to the $(0, 0)$ location of $\mathrm{submat}(S_0)$ as a submatrix within $S_0$.

Project / Task

Improve our capabilities of generating submatrices of sketching operators.

Consider a sketching operator $S_0$. As an example, suppose that based on your mathematical application at hand, you imagine or conceptualize $S_0$ as a block matrix, partitioned as

    S_0 = \begin{bmatrix} (S_0)_{00} & (S_0)_{01} \\\ (S_0)_{10} & (S_0)_{11} \end{bmatrix}.

Suppose further that you want to apply

\mathrm{submat}(S_0) = (S_0)_{11}

to a data matrix $A$. The modifications to LSKGE3 in this project would let us generate and apply $(S0){11}$ without generating any other block in $S_0$.

Doing this correctly will require careful implementation and thorough tests.
- A serial implementation would not be too hard.
- A proper implementation would need to allow for OpenMP parallelism.
- Our current implementation of generating sketching operators uses OpenMP parallelism.
Any "vaguely reasonable" implementation will have big efficiency gains over our current method.
Once we can do this correctly, we modify the LSKGE3 implementation to use it, rather than the naive method.

rileyjmurray commented 1 year ago

@kaiwenhe7 I've merged the big PR (#50) that I mentioned in our meeting. Please update your fork's main branch so that it's even with the official RandBLAS main branch. Then create a new branch on your fork where you'll conduct this work.

The first step is to add your code as a new function in RandBLAS/dense.hh. Then you'll change the implementation of the function defined here: https://github.com/BallisticLA/RandBLAS/blob/d21aac75cbb79f4fedf768ace1d1f034134eda4f/RandBLAS/dense.hh#L208-L214 so that it appropriately (and hopefully, trivially) falls back on your new function. You'll have completed the first step when you verify that these changes haven't broken existing RandBLAS tests. Let me know if you have problems with this.

Once you have existing tests passing, you'll want to update RandBLAS tests so that they include the test cases you showed me this morning. I'll specifically want you to update this file to include (1) a new class that extends ::testing::Test and (2) several new TEST_F code blocks which use that class. You can look at the existing code in that file and look up GoogleTest documentation to get a sense of how things work. If you run into trouble with something that seems like it should be simple, let me know, and I can try to help.

You can (and probably should) open a work-in-progress PR on the official RandBLAS repo as soon as you start working on the first step above. Once your PR completes both of the steps above, we can merge it. At that point, I'll modify the RandBLAS functions for dense sketching to actually take advantage of the new functionality you'll have added.

rileyjmurray commented 1 year ago

@kaiwenhe7, I'm pasting your emailed message here and responding.

I just wanted a bit more clarification on the point regarding the reimplementation of fill_rmat. Would we need to change fill_rmat so that it calls the submatrix function? Looking at the implementation for fill_rmat, it would be more efficient to just keep the implementation than calling my function due to the submatrix function doing extra computation for bookkeeping.

My suggestion is that fill_rmat should call your new submatrix function. I suspect that the efficiency losses will be minimal. Even if they're noticeable, we still want to keep things to a single function. It's very important that we can confidently define the details of RandBLAS' behavior.

kaiwenhe7 commented 1 year ago

@rileyjmurray I have made all the changes that you have requested and I made a pull request. Let me know if you have any questions for concerns.

rileyjmurray commented 1 year ago

@kaiwenhe7's PR #51 made major progress on this. Up next I'll modify dense sketching functions to take advantage of this new functionality.

rileyjmurray commented 1 year ago

Resolved by #52.

BallisticLA / RandBLAS

Efficiently generate submatrices of dense sketching operators #36

Context

Project / Task