inversion performance improvements

sleepybishop commented 3 years ago

Some ideas to improve performance of precode matrix inversion:

During "First Phase", you will find that r == 2 the vast majority of the time, optimizing for this and r <= 3 will improve performance.
Given enough overhead during decoding (eg. rank(gf2_matrix) == L), inversion can stay in gf2 and avoid expensive hdpc operations completely.
With some care, you can work only on the dense submatrix U from "Second Phase" and onwards.
If you are caching the schedule of operations or "prebuilt plan", you can leverage it during "Third Phase" of the initial computation to replay the required operations from the plan to create a sparse U upper instead of matrix multiplication as the rfc implies.

Hopefully these are helpful insights.

cberner commented 3 years ago

Thanks for the suggestions! I spent a while optimizing the r == 1 and r == 2 cases, if you see anything specific that looks like it could be improved though I'd love to know!

Ah yes, decoding is an area that I haven't focused on optimizing very much.

With some care, you can work only on the dense submatrix U from "Second Phase" and onwards. Interesting, that sounds like it could improve encoding performance. Can you tell me more about how that works?

sleepybishop commented 3 years ago

With some care, you can work only on the dense submatrix U from "Second Phase" and onwards. Interesting, that sounds like it could improve encoding performance. Can you tell me more about how that works?

After "First Phase", the i x i matrix in the upper left of A is lower triangular, the operations to turn it into the identity matrix are recorded into the plan but not applied to columns outside of U since it's a given that the i x i submatrix will become an identity matrix.

In order to create the submatrix I_u, the columns 0:i within rows i:L need to be zeroed out, again these changes need not be applied to A since it's a given that they will become zero, instead just the columns within U are modified and the operations required are recorded in the plan.

After this all that is left to do is back solving U_upper using I_u, these operations will not create fill-in in A because all columns to the left are zero.

cberner commented 3 years ago

Ah yes, I think I understand your suggestion.

In order to create the submatrix I_u, the columns 0:i within rows i:L need to be zeroed out, again these changes need not be applied to A since it's a given that they will become zero, instead just the columns within U are modified and the operations required are recorded in the plan.

I do this, by ignoring all columns left of i: https://github.com/cberner/raptorq/blob/95b6b5ae9100d2af9518f76450dba93bcad79902/src/pi_solver.rs#L665

After this all that is left to do is back solving U_upper using I_u, these operations will not create fill-in in A because all columns to the left are zero.

Oh yes, I think I see how this works, and I don't have this optimization. Do you know why the RFC recommends performing the Third Phase by doing this?

the matrix X is multiplied with the submatrix of A consisting of the first i rows of A.  
After this operation, the submatrix of A consisting of the
intersection of the first i rows and columns equals to X, whereas the
matrix U_upper is transformed to a sparse form.

I followed the RFC (https://github.com/cberner/raptorq/blob/95b6b5ae9100d2af9518f76450dba93bcad79902/src/pi_solver.rs#L695), but I see how your suggestion of just continuing gaussian elimination would be simpler

edit If I understand your suggestion correctly, you're saying the 3rd, 4th, and 5th phases can all be combined by apply backwards elimination of U_upper using I_u, after the second phase

sleepybishop commented 3 years ago

Do you know why the RFC recommends performing the Third Phase by doing this?

Because converting the i x i submatrix into the identity matrix makes U_upper pretty dense, making back solving slow.

However like i mentioned in my initial comment:

If you are caching the schedule of operations or "prebuilt plan", you can leverage it during "Third Phase" of the initial computation to replay the required operations from the plan to create a sparse U upper instead of matrix multiplication as the rfc implies.

Instead of maintaining the matrix X you can reverse the operations in the plan that involved converting the i x i matrix from LT to identity, then back solve a much sparser matrix and then reapply those same operations to get back to an identity matrix.

This effectively combines 3rd, 4th, and 5th phases as you surmised but without requiring the X matrix at all.

cberner commented 3 years ago

Got it, right. Thanks for explaining that! Hopefully I'll have time, this weekend, to try implementing this and see how much it improves performance :)

cberner commented 3 years ago

@sleepybishop, I started on this today, but after reading through the RFC & my implementation again I'm not sure how to replay the required operations from the plan to create a sparse U upper instead of matrix multiplication. The operations in the plan include row add operations, whereas X is constructed only from swapping rows & cols. Can you explain how you do that step?

sleepybishop commented 3 years ago

Sure, after "First Phase", the i x i matrix is X and was created purely from row and column swaps.

From here on out each operation that eliminates a column in the LT part of X is stored in the plan as a row add operation, only the columns within U need to actually be computed as the end result is that X will be an identity matrix.
After all columns in the LT part of X are gone, mark the position in the list of operations in the plan and continue solving for I_u.
Now U_upper needs to zeroed out but the previous process has created a lot of fill-in, the rfc says multiply A * X which returns A back to the state it was after First Phase, but that is equivalent to undoing the the operations in the plan previously computed.
Reapply the each operation from the marked position down to the first operation, (eg. if a row[a] was added to row[b], then add row[a] to row[b] again) since row[a] ^ row[b] ^ row[a] = row[b], but again only make the changes in the columns in U because the columns in X are not required, now U_upper is back to the state it was is after First Phase, sparse, but I_u remains and backsolving will be much faster. Note that when replaying an operation, it is recorded into the plan again.
After U_upper is zeroed out, 'X' needs to be returned to the identity matrix but the steps to do that are already in the plan so play them again from beginning to the marked position, with the difference being that U will not change since all rows in U_upper are zero. So all that's being accomplished is adding more operations to the plan.

These steps avoid the need to track changes to the X matrix and a having to multiply A*X while resulting in the same steps applied to U. Hopefully it makes a bit more sense now.

cberner commented 3 years ago

Ah yes, I think I understand now. Thanks for explaining it in detail! I'll try it out

cberner commented 3 years ago

Just implemented this and got a nice ~15% speedup: https://github.com/cberner/raptorq/pull/87 . Thanks for the help!

Do you know of any tricks for optimizing the graph selection for r=2 in the first phase? I've already optimized the data structures quite heavily, but that's still the most expensive processing step. I was thinking maybe I could cache the results, but the elimination process of the first phase seems to frequently create new rows with r=2 which invalidates the cached graph.

sleepybishop commented 3 years ago

Just implemented this and got a nice ~15% speedup: #87 . Thanks for the help!

Good Work!

Do you know of any tricks for optimizing the graph selection for r=2 in the first phase? I've already optimized the data structures quite heavily, but that's still the most expensive processing step. I was thinking maybe I could cache the results, but the elimination process of the first phase seems to frequently create new rows with r=2 which invalidates the cached graph.

One optimization is to use the transpose of A as an index to find the rows that need the count of ones adjusted when a column is moved into U. This can be reused in Fourth Phase to speed up backsolving.

Beyond that, it is possible to prune and grow the graph as needed without recomputing it, however graph maintenance is still relatively expensive.

cberner commented 3 years ago

Ah yep, I already keep an index of values by column to speed up those adjustments.

Ya, I already tried doing graph maintenance for each operation, but that turned out to be slower than just recomputing for each r=2 selection.

cberner commented 3 years ago

@sleepybishop one other question, do you know if implementing this part of the RFC: Furthermore, the row operations required for the HDPC rows may be performed for all such rows in one process, by using the algorithm described in Section 5.3.3.3 significantly improves efficiency relative to computing the matrix product MT * GAMMA and materializing the HDPC rows?

After the latest round of optimizations, essentially all of encoding time is spent in three areas, the last of which can't really be optimized further: 1) Computing MT * GAMMA to generate the HDPC rows 2) First phase, primarily constructing the graph for r=2 selection and then sparse row elimination 3) Performing the symbol additions to generate the actual intermediate symbols (this is already heavily optimized with SIMD)

sleepybishop commented 3 years ago

yes, computing G_HDPC via MT * GAMMA is costly both in memory and cpu because GAMMA is a rather large matrix. So computing it piecemeal is much faster, furthermore when decoding if you have enough repair rows you can avoid computing G_HDPC entirely.

Given enough overhead during decoding (eg. rank(gf2_matrix) == L), inversion can stay in gf2 and avoid expensive hdpc operations completely.

cberner commented 3 years ago

Cool, thanks. Ya, I'd like to implement that decoding optimization at some point. Right now I'm trying to finish up optimizing the encode path

sleepybishop commented 3 years ago

First phase, primarily constructing the graph for r=2 selection and then sparse row elimination

the rfc says the intermediate symbols can be calculated via C = (A^^-1)*D and provides the steps detailed in the phases as a relatively efficient means of achieving that. If you are OK with not following the steps in First Phase to the letter, simpler pivoting strategies like Markowitz pivoting can be used at the expense of some extra row xor ops during back solving.

cberner commented 3 years ago

I'm not familiar with Markowitz pivoting, do you have any pointers to how it works? And is the tradeoff that the plan compilation is faster, but the resulting plan will have more symbol ops?

sleepybishop commented 3 years ago

I'm not familiar with Markowitz pivoting, do you have any pointers to how it works?

Sure, see this paper, page 52/section 3.2.

And is the tradeoff that the plan compilation is faster, but the resulting plan will have more symbol ops?

Yep, that's the idea, a suboptimal strategy will create more fill-in, so it's a tradeoff between the performance of graph methods and the performace of xor row ops, which you've already optimized heavily with SIMD.

cberner commented 3 years ago

Cool, thanks! I may give that a try. I'm still thinking through some potential data structure changes to speed up the graph calculation. I feel like there might be a significant speedup to be had, but am not sure yet

sleepybishop commented 3 years ago

With graph methods the thing to exploit is that typically one component will be really large compared to the rest.

cberner commented 3 years ago

Yep, I already leverage that by selecting an edge from the longest cycle during the r=2 selection step

cberner commented 3 years ago

I optimized the graph substep using a union-find data structure: https://github.com/cberner/raptorq/pull/95 It's now a much smaller fraction of the time

sleepybishop commented 3 years ago

Cool, did you get a significant boost in encoding throughput? What's the largest bottleneck now?

cberner commented 3 years ago

Ya, about 10% on large symbol counts. I just released it in 1.6.2. It's spread out over a fair number of things now, but the sparse FMAs are the largest one. Another large piece is generating the HDPC rows of the constraint matrix. Here's a profile: flamegraph.zip

sleepybishop commented 3 years ago

What do you mean sparse FMA's? I thought you had refactored to only use dense ops on U matrix?

Optimizing the generation of G_HDPC is straight forward but you have to do it piece by piece instead of via matrix multiplication because GAMMA matrix is huge: K' + S x K' + S.

Use Section 5.7.2 from the rfc as a reference.

Start with an empty G_HDPC matrix of size H x K' + S
For each row i, set the last column to alpha ^^ i
Starting from the second to last column j, set each column in every row to alpha * j+1
- Now you just have to recover what the multiplication with matrix MT would have done, namely add 1 to G_HDPC[i,j] where i = Rand[j+1,6,H] and i = (Rand[j+1,6,H] + Rand[j+1,7,H-1] + 1) % H from section 5.3.3.3 of the rfc.

You can work this out yourself by working backwards from a G_HDPC matrix generated via MTxGAMMA.

cberner commented 3 years ago

I removed most of the sparse additions, but the ones in phase 1 are still there. They were a relatively small amount of time before. They're next up on my list of things to optimize now that the graph data structure is more efficient.

Ah thanks for the tip on generating the HDPC matrix! I'll give that a go

cberner commented 3 years ago

Ah, actually the sparse additions in phase 1 weren't taking much time. I removed them and it improved throughput by ~2%. The profile shows 4.1% in SparseBinaryMatrix::add_assign_row, but seems that it didn't get the full speedup

cberner commented 3 years ago

I tried the HDPC generation optimization you suggested, but wasn't able to make it work. Perhaps I implemented it wrong? https://github.com/cberner/raptorq/pull/101

For this part "set each column in every row to alpha * j+1", do you mean alpha ^ (j + 1) or alpha * (j + 1)? It seems to me that it should actually be alpha ^ (i + j) because the last column of MT is alpha ^ i

I wasn't able to re-derive this part either: "namely add 1 to G_HDPC[i,j]". It seems like you need to add the j'th row in GAMMA to the i'th row of G_HDPC

sleepybishop commented 3 years ago

You're almost there, the order of operations if off. the first pass is simply:

for i in 0..H {
  result[i][Kprime + S - 1] = Octet::alpha(i).byte();
}

because the remaining steps depend on the last column.

For this part "set each column in every row to alpha j+1", do you mean alpha ^ (j + 1) or alpha (j + 1)? It seems to me that it should actually be alpha ^ (i + j) because the last column of MT is alpha ^ i

it's alpha * j + 1, If you look at a generated GAMMA the values double every index.

So then it's just a matter of shuffling your logic around a bit:


for j in (0..=(Kprime + S - 2)).rev() { // edited <- loop should count down
  // fill in column j
  for i in 0..H {
    result[i][j] = (Octet::alpha(1) * Octet::new(result[i][j+1])).byte()
  }
  // recover the changes by multiplication with MT
  let rand6 = rand((j + 1) as u32, 6u32, H as u32) as usize;
  let rand7 = rand((j + 1) as u32, 7u32, (H - 1) as u32) as usize;
  let i1 = rand6;
  let i2 = (rand6 + rand7 + 1) % H;
  result[i1][j] ^= 1;
  result[i2][j] ^= 1;
}

cberner commented 3 years ago

Oh yes, of course, compute it recursively, very clever! Thanks, I'm adding this in

cberner commented 3 years ago

Nice. Another ~10% speedup from that HDPC optimization :)

Looks like the next thing I need to do is revisit some of the matrix methods. Things like iterating through a column of values and swapping column indices are some of the most expensive ops now

jblestang commented 5 months ago

Any reason to not merge the current performance speedup?

cberner commented 5 months ago

It's already released. Not sure which you're referring to?

Sent from my phone

On Sun, Feb 25, 2024, 4:39 AM jblestang @.***> wrote:

Any reason to not merge the current performance speedup?

— Reply to this email directly, view it on GitHub https://github.com/cberner/raptorq/issues/84#issuecomment-1962925496, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGNXQHTB5HZBF6ICUWODETYVMWITAVCNFSM4TVHO4S2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJWGI4TENJUHE3A . You are receiving this because you commented.Message ID: @.***>

jblestang commented 5 months ago

The issue is still opened and I was wondering why.

cberner / raptorq

inversion performance improvements #84

Some ideas to improve performance of precode matrix inversion: