Adding rows or columns to dense sketching operators

rileyjmurray commented 1 year ago

Edit

Per Kaiwen's comment, this is actually quite easy. The main things users might still need are (1) populating $\texttt{S.next}\textunderscore\texttt{state}$ immediately when $\texttt{S}$ is constructed and (2) creating an API that doesn't require users to call fill_dense_submat. Besides those things, we'd want to update the web documentation. Right now there's relevant docs on "generating your N-th sketching operator, for N > 1". It would be good to adapt the (very long!) content of my original comment below into proper web docs.

Original content

RandBLAS should make it easy to implicitly extend an initial sketching operator $S_1$ into an augmented operator $S = [S_1; S_2]$ or $S = [S_1, S_2]$. There are four scenarios that a user can find themselves in where this can be done without generating $S$ from scratch. I've provided them for completeness below.

Remark 1. In all four scenarios, the idea is that we should be able to use the $\texttt{next}\textunderscore\texttt{state}$ of $S_1$ as the $\texttt{seed}\textunderscore\texttt{state}$ of $S_2$. Our current implementation of $\texttt{DenseSkOp}$ doesn't set $\texttt{S.next}\textunderscore\texttt{state}$ until the user calls fill_dense on $\texttt{S}$. So at the very least, we need to change our code so that $\texttt{S.next}\textunderscore\texttt{state}$ is set the moment that $\texttt{S}$ is constructed.

Remark 2. The four scenarios below show two reasons why you'd want to extend a sketching operator; you might be trying to improve statistical quality by increasing sketch size (Scenarios 1 and 4), or you might be incorporating new data into a sketch of fixed size (Scenarios 2 and 3). The unifying perspective on Scenarios 1 and 4 is that they both add long-axis vectors to the sketching operator. The unifying perspective on Scenarios 2 and 3 is that they both add short-axis vectors to the sketching operator. $\texttt{DenseDist}$ objects have a $\texttt{major}\textunderscore\texttt{axis}$ member, which states whether operators sampled from that distribution are short-axis or long-axis major. So when a user specifies the major axis for a sketching operator, they're basically saying whether they might care about improving the statistical quality of a sketch or updating a sketch to incorporate more data.

Remark 3. Although this post is extremely long, it isn't prescriptive about how our API should expose these kinds of operations. It might be that we only introduce one or two helper functions. Maybe the main thing will be to illustrate these operations with example code and mention them in our documentation. It's hard to say! I'm opening this issue to help @kaiwenhe7 get started on this topic.

Scenario 1

Suppose $S_1$ is a wide $d_1 \times m$ row-major $\texttt{DenseSkOp}$ with seed $c$. It's easy to generate a $d_2 \times m$ row-major $\texttt{DenseSkOp}$ $S_2$ in such a way that $S = [S_1; S_2]$ is the same as the $(d_1 + d_2) \times m$ row-major $\texttt{DenseSkOp}$ with seed $c$.

This scenario arises if we have a fixed data matrix $A$, an initial sketch $B_1 = S_1 A$, and we decide we want a larger sketch for statistical reasons. The updated sketch $B = S A$ can be expressed as

 B = \begin{bmatrix} S_1 A \\ S_2 A \end{bmatrix}.

If $(S_1, S_2, S)$ satisfy the assumptions above, then the final sketch $B = SA$ will be the same regardless of whether we computed the sketch in one step or two steps. This is desirable for benchmarking and debugging RandNLA algorithms where the sketch size is a tuning parameter.

Scenario 2

Suppose $S_1$ is a wide $d \times m_1$ column-major $\texttt{DenseSkOp}$ with seed $c$. It's easy to generate a $d \times m_2$ column-major $\texttt{DenseSkOp}$ $S_2$ so that $S = [S_1, S_2]$ is the same as the $d \times (m_1 + m_2)$ column-major $\texttt{DenseSkOp}$ with seed $c$.

This situation arises if we have an initial data matrix $A_1$, an initial sketch $B_1 = S_1 A_1$, and then a new matrix $A_2$ arrives in such a way that we want a sketch of $A = [A_1; A_2]$. To compute $B = SA$, we update $B_1$ according to the formula

B = \begin{bmatrix} S_1 & S_2 \end{bmatrix} \begin{bmatrix} A_1 \\ A_2 \end{bmatrix} = B_1 + S_2 A_2.

If $(S_1, S_2, S)$ satisfy the assumptions above, then $B$ will be the same as though we started with all of $A$ from the very beginning. This is useful for benchmarking and debugging RandNLA algorithms that involve operating on data matrices that increase in size over some number of iterations.

Scenario 3.

Let $S_1$ be a tall $n \times d_1$ column-major $\texttt{DenseSkOp}$ with seed $c$. We can easily generate an $n \times d_2$ column-major $\texttt{DenseSkOp}$ $S_2$ so that $S = [S_1, S_2]$ is the same as the $d \times (n_1 + n_2)$ column-major $\texttt{DenseSkOp}$ with seed $c$.

This arises we have a fixed data matrix $A$, an initial sketch $B_1 = A S_1$, and we decide we want a larger sketch for statistical reasons. The updated sketch $B = A S$ can be expressed as

 B = \begin{bmatrix} A S_1 & A S_2 \end{bmatrix}.

If $(S_1, S_2, S)$ satisfy the assumptions above, then the final sketch $B$ will be the same regardless of whether we computed the sketch in one step or two steps. This is desirable for benchmarking and debugging RandNLA algorithms where the sketch size is a tuning parameter.

Scenario 4.

Suppose $S_1$ is a tall $n_1 \times d$ row-major $\texttt{DenseSkOp}$ with seed $c$. It's easy to generate an $n_2 \times d$ row-major $\texttt{DenseSkOp}$ $S_2$ in such a way that $S = [S_1; S_2]$ is the same as the $(n_1 + n_2) \times d$ row-major $\texttt{DenseSkOp}$ with seed $c$.

This situation arises if we have an initial data matrix $A_1$, an initial sketch $B_1 = A_1 S_1$, and then a new matrix $A_2$ arrives in such a way that we want a sketch of $A = [A_1, A_2]$. To compute $B = AS$, we update $B_1$ according to the formula

B = \begin{bmatrix} A_1 & A_2 \end{bmatrix} \begin{bmatrix} S_1 \\ S_2 \end{bmatrix} = B_1 + A_2 S_2.

If $(S_1, S_2, S)$ satisfy the assumptions above, then $B$ will be the same as though we started with all of $A$ from the very beginning. This is useful for benchmarking and debugging RandNLA algorithms that involve operating on data matrices that increase in size over some number of iterations.

kaiwenhe7 commented 1 year ago

@rileyjmurray I was thinking about this, and given the fill_submat function, it should be very trivial to extend the sketching operator. It should only require another call to the fill_submat function with the correct ptr.

For example, Scenario 1. Given a data matrix $A$ of dimension $m \times n$, I first generate a wide row-major $d_1 \times m$ sketching operator, $S_1$ with seed $s$ and then get an initial sketch to get $B_1 = S_1 \times A$. Then suppose I want to generate a larger sketch, $S = [S_1; S_2]$, it should simply take a call to fill_submat on S_2. where the ptr is $d_1 \cdot m$ the dimension is $d_2 \times m$ and seed $s$.

However, it is very nontrivial if we are given a row major sketching operator and then extending the number of columns and vice versa. Luckily, it seems like none of the scenarios require this.

rileyjmurray commented 1 year ago

@kaiwenhe7 -- you're absolutely right! I can't believe I didn't think about that. I guess one of the key questions, then, is what convenience functions we might use to expose this fact in the API.

We'd also want to get the analogous functionality for sparse sketching operators. I'm cleaning that API up now. Once my pending PR is merged, you can work on that.

There's one other thing that stands out in generating submatrices of dense sketching operators: the tests for generating subsets of rows are extremely expensive (they take almost 10x as long as the rest of the test suite). It would be helpful if you could look into why that's happening and report back.

kaiwenhe7 commented 1 year ago

@rileyjmurray Regarding the testing time, the tests I have loops through a lot of different sub matrices. For example, the TEST_F for the col_wise test generates 1900 sub matrices of dimension 40 by 100.

rileyjmurray commented 1 year ago

@kaiwenhe7, sure, but row_wise takes much longer than the others. Here is the timing data from my M1 Pro macbook:

(rla310) Rileys-MacBook-Pro-2:RandBLAS-build riley$ ctest
Test project /Users/riley/BALLISTIC_RNLA/randla/RandBLAS-build
        Start   1: TestDenseMoments.Gaussian
  1/115 Test   #1: TestDenseMoments.Gaussian .................................   Passed    0.13 sec
        Start   2: TestDenseMoments.Uniform
  2/115 Test   #2: TestDenseMoments.Uniform ..................................   Passed    0.10 sec
        Start   3: TestSubmatGeneration.col_wise
  3/115 Test   #3: TestSubmatGeneration.col_wise .............................   Passed    0.10 sec
        Start   4: TestSubmatGeneration.row_wise
  4/115 Test   #4: TestSubmatGeneration.row_wise .............................   Passed    1.66 sec
        Start   5: TestSubmatGeneration.diag
  5/115 Test   #5: TestSubmatGeneration.diag .................................   Passed    0.11 sec

We see that TestSubmatGeneration.row_wise took 1.66 seconds, which is 16 times longer than TestSubmatGeneration.col_wise. Why is the test for row_wise so much more expensive? Are there more edge cases you need to consider?

kaiwenhe7 commented 1 year ago

@rileyjmurray Oh sorry! test_rowwise_smat_gen goes along the row! So for a submat dimension of 40 by 100 and full random matrix dimension of 100 by 2000, the col_wise test generates 60 sub matrices and the row_wise test generates 1900 sub matrices. So, we would expect the row_wise test time to be 32 times slower than the col_wise test time... which also doesn't make sense. Since row_wise seems to be only 16 times slower. There is a suspicious factor or 2 missing that I will look into.

rileyjmurray commented 1 year ago

@kaiwenhe7, can you change the row_wise tests so that fewer matrices are tested by default? You can keep the general functionality in there, but I think it's sufficient to test row_wise where the starting column index is in {0, 1, 2, 3, 4, 1896, 1896, 1897, 1898, 1899} or like one of 50 contiguous values between 4 and 1896 (you can choose those somewhat of arbitrarily, but it should be deterministic).

What you currently have is more like a torture test. Later this summer, we can set up a separate torture test suite. I want the "main" test suite to be cheap to run because we're currently developing and potentially breaking things rapidly. Like, if I compile with the address sanitizer, then the submatrix generation tests take more than 15 seconds. That's agonizingly long for interactive debugging on unrelated tests.

Edit: I wouldn't stress about the missing factor of 2 in runtime. Correctness tests pass and that's what we care about.

kaiwenhe7 commented 1 year ago

@rileyjmurray I changed both the row_wise and col_wise tests so that it does 50 contiguous values between 4 and 1896 and 4 and 100. I made a new pull request for this change.

Regarding the sub matrix extension feature, could you clarify by what you mean by convenience functions in your earlier comment?

rileyjmurray commented 1 year ago

Thanks, @kaiwenhe7. I've merged that PR.

In order to do what you're suggesting, the user would need to call fill_dense_submat or fill_dense_submat_impl. Both of these functions are about populating buffers. However, the spirit of the RandBLAS API is to hide the implementation details of sketching operators. This is partly because we want the APIs for dense sketching operators and sparse sketching operators to be nearly identical.

In this context, hiding implementation details at least means not insisting that the user provide their own buffer. They should instead be able to perform all of these tasks with DenseSkOp objects.

BallisticLA / RandBLAS