Playing with hmatrix approximation of tde_matrix. Also sparsity.

brendanjmeade / celeri

Next generation earthquake cycle kinematics

BSD 3-Clause "New" or "Revised" License

24 stars 6 forks source link

Playing with hmatrix approximation of tde_matrix. Also sparsity. #64

Closed tbenthompson closed 2 years ago

tbenthompson commented 2 years ago

It works just fine for approximation of the TDE matrix. I just did the minimal test to see what the compression ratio would be. For a tolerance of 1e-4, we get a 5x compression or with tolerance of 1e-6, a 3x compression. These aren't very impressive numbers, but the Cascadia example here is not the best demonstration since it's not very large. The linear scaling will mean that for a problem 10x as many source triangles and 10x as many observation points, you'll see ~30-50x compression with the H-matrix implementation.

But this actually has me wondering whether your TDE matrices are dense in the first place. Are you calculating TDE influence for every observation even in the global model? It seems like the spherical/projection error would dominate any displacement numbers once you got more than a few thousand km from the source. It seems fine to just assume those coefficients are exactly zero. So, actually, the TDE component for a global block model should be naturally sparse and you can probably accelerate computations with it a huge amount without doing any H-matrix stuff at all. Have you thought about this at all? In order to take advantage of this sparsity, it would probably be necessary to use iterative least squares solvers, but that's easily available with scipy.

brendanjmeade commented 2 years ago

It works just fine for approximation of the TDE matrix. I just did the minimal test to see what the compression ratio would be. For a tolerance of 1e-4, we get a 5x compression or with tolerance of 1e-6, a 3x compression. These aren't very impressive numbers, but the Cascadia example here is not the best demonstration since it's not very large. The linear scaling will mean that for a problem 10x as many source triangles and 10x as many observation points, you'll see ~30-50x compression with the H-matrix implementation.

I can't believe you got this implemented so quickly and it works! This is super interesting and I'll try it out tomorrow. The estimates for compression are really important, as we're often memory limited for larger problem.

But this actually has me wondering whether your TDE matrices are dense in the first place. Are you calculating TDE influence for every observation even in the global model? It seems like the spherical/projection error would dominate any displacement numbers once you got more than a few thousand km from the source. It seems fine to just assume those coefficients are exactly zero. So, actually, the TDE component for a global block model should be naturally sparse and you can probably accelerate computations with it a huge amount without doing any H-matrix stuff at all. Have you thought about this at all?

You are right that we do all TDE to station combinations, even when they are on the other side of the Earth. There may not be a for that at the mm/yr level. Jack and I did discuss this year's back but struggled to come up with a method for something like a distance cutoff. This was complicated because it's not just TDE to station distance that matters, but also the slip rate. 500 km might be reasonable for a slowly slipping area, but not for a fast slipping one. We never got past that part.

In order to take advantage of this sparsity, it would probably be necessary to use iterative least squares solvers, but that's easily available with scipy.

While I'm still not certain about how to do a distance cutoff thoughtfully, I should experiment with an iterative solver for the block model solve. That'll be interesting to try because the full block model has TDEs, block rotations, slip rate constraints, block rotation constraints and, in general, there are way more TDEs than anything else. Maybe I can just start with https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.gmres.html which should work for the dense case that I have right now?

Brendan

tbenthompson commented 2 years ago

I can't believe you got this implemented so quickly and it works! I took a lot of shortcuts: 1) I didn't do an actual inversion, but I have no reason to think it wouldn't work. Probably another 20 lines of code or so to set up an iterative linear solve. 2) I didn't do adaptive cross approximation (ACA) and instead just constructed the H-matrix blocks using the SVD. ACA allows constructing the H-matrix without having computed all the entries already.

Also, I had written this page before which is basically exactly what's needed: https://tbenthompson.com/book/tdes/hmatrix.html

500 km might be reasonable for a slowly slipping area, but not for a fast slipping one. We never got past that part.

Hmm, it seems like at a bare minimum, you could choose a cutoff distance that would be appropriate for the fast slipping parts of the world: 5000km? Then, it would apply everywhere else too and just be conservative in the slow slipping regions. I wouldn't be surprised if that still cuts your total matrix entry count by 5x or 10x.

With BIE/BEM methods, we could do a comparison to see the threshold distance where the half-space approximation is no longer valid. Because I'm guessing that the distance where the half-space numbers are nonsense is much closer than the distance where the half-space numbers become super small (1000 vs 5000 km?).

There's also going to be a threshold at which the constant material properties assumption is going to be violated. Once you get 1000+ km away, the relevant zone of elastic crust will also be 1000+ km deep. I don't know how much this would matter.

The combination of incorrect material properties and incorrect curvature has me guessing that something around 2000 km would be the right number. But as I mentioned, this is testable.

Another question here is how much global block model inversion results change as a result of changing this cut-off threshold. How sensitive are the final results to far-field influence coefficients?

brendanjmeade commented 2 years ago

I didn't do an actual inversion, but I have no reason to think it wouldn't work. Probably another 20 lines of code or so to set up an iterative linear solve.

Ok, I'm really getting interested in the big idea of just using sparser representations of H-matrix approaches for the whole problem! This would be much more natural in so many ways, especially with TDE boundary conditions. So a few questions:

1 - I'm very confident about H-matrices and iterative solvers for the TDE only problem but I really want to test something out an iterative solver on the block model problem as it is currently with a dense TDE matrix and sparse matrices for slip rate constraints, etc. Do you have a recommended solver to try for this? I tried GMRES last night but realized it needs square matrices and the one that we have is over-determined.

2 - Is it hard to do the H-matrix solve?

3 - Can I merge this commit so that it's easy for me to explore this?

I didn't do adaptive cross approximation (ACA) and instead just constructed the H-matrix blocks using the SVD. ACA allows constructing the H-matrix without having computed all the entries already.

ACA seems particularly relevant for the blocks problem because we never want to store the big linear operator in memory.

Also, I had written this page before which is basically exactly what's needed: https://tbenthompson.com/book/tdes/hmatrix.html

Here is the call to approx_block(approx[0][0], approx[0][1], 0.01, 1e-4, verbose=True) where the model solve happens?

Hmm, it seems like at a bare minimum, you could choose a cutoff distance that would be appropriate for the fast slipping parts of the world: 5000km? Then, it would apply everywhere else too and just be conservative in the slow slipping regions. I wouldn't be surprised if that still cuts your total matrix entry count by 5x or 10x. With BIE/BEM methods, we could do a comparison to see the threshold distance where the half-space approximation is no longer valid. Because I'm guessing that the distance where the half-space numbers are nonsense is much closer than the distance where the half-space numbers become super small (1000 vs 5000 km?). There's also going to be a threshold at which the constant material properties assumption is going to be violated. Once you get 1000+ km away, the relevant zone of elastic crust will also be 1000+ km deep. I don't know how much this would matter. The combination of incorrect material properties and incorrect curvature has me guessing that something around 2000 km would be the right number. But as I mentioned, this is testable.

I totally agree with all of this. I think at a practical level we could start with 1000-2000 km. The projections start to get bad after that, and your point about material properties is very well taken

Another question here is how much global block model inversion results change as a result of changing this cut-off threshold. How sensitive are the final results to far-field influence coefficients?

Definitely testable even with just the Cascadia model!

tbenthompson commented 2 years ago

Ok, I'm really getting interested in the big idea of just using sparser representations of H-matrix approaches for the whole problem!

Awesome! I'm happy to help out though I've been getting a bit busier lately.

Do you have a recommended solver to try for this?

Yeah, I'd reccomend lsmr or lsqr: https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.lsmr.html

2 - Is it hard to do the H-matrix solve?

If we're doing an iterative solve, then no, it's not hard to do an H-matrix solve. For a direct solve, it is hard but is something that I've love to implement!

3 - Can I merge this commit so that it's easy for me to explore this?

Sure! I think it'll mess up your notebook a little bit but you can recover it.

ACA seems particularly relevant for the blocks problem because we never want to store the big linear operator in memory.

It's actually more of a speed than a memory thing. Even without ACA, you can construct each far-field block and then do an SVD on it and then throw away the original far-field block. So, there's only one dense block in memory at a time which is probably 100kb or something.

Here is the call to approx_block(approx[0][0], approx[0][1], 0.01, 1e-4, verbose=True) where the model solve happens?

That is the step that constructs the far-field blocks. It's using ACA to construct approximate far-field blocks without constructing the original matrix. If you're reading through and have time, I'd love if you noted down the more confusing parts. That section is not the best writing/explanation and a few parts seem underexplained. There's also this section further down, 5.7, that uses cutde instead to compute the ACA approximate far-field blocks much faster: https://tbenthompson.com/book/tdes/hmatrix.html#faster-approximate-blocks-on-gpus-with-cutde

I think the initial list of steps is useful though. The approx_blocks call is step 4. "Solving" would be step 5, probably using an iterative method.

Copied from the intro on that page: To build an H-matrix implementation, we need to:

build a tree structure in order to determine which groups of elements are far away from each other.
traverse that tree to split the matrix blocks into near-field and far-field blocks.
compute the exact matrix for each near-field block.
compute the approximate matrix using the ACA algorithm for each far-field block.
perform fast matrix-vector products with both the near-field exact matrices and the far-field approximate matrices.

brendanjmeade commented 2 years ago

Thanks @tbenthompson! Lots of comments below:

Awesome! I'm happy to help out though I've been getting a bit busier lately.

Much appreciated and me too with new semester approaching

Yeah, I'd reccomend lsmr or lsqr: https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.lsmr.html

Thanks for the pointer! Good news: If I cast estimation.operator to a sparse csr matrix and solve with lsmr it runs and gives me an solution for block motions that looks pretty ok. The thing I need to learn how to do is do a weighted inversion here because we always use a weighting matrix. Specifically for the fully dense weighted case I do:

estimation.state_covariance_matrix = np.linalg.inv(estimation.operator.T * estimation.weighting_vector @ estimation.operator)
estimation.state_vector = estimation.state_covariance_matrix @ estimation.operator.T * estimation.weighting_vector @ estimation.data_vector

and for the sparse lsmr case I can do

estimation.state_vector_unweighted_sparse_lsqr = scipy.sparse.linalg.lsqr(csr_matrix(estimation.operator), estimation.data_vector)

but this is not weighted, and this causes the TDE contribution to be negligible because the magnitudes of the matrix entries for these are so much smaller than for the block rotations and slip rate constraints. Our weighting matrix is purely diagonal, and that's I just use a weighting vector in the sense direct case. I'm betting there's just a pre-multiplication step I can do to make the interative lsqr, lsmr weighted?

If we're doing an iterative solve, then no, it's not hard to do an H-matrix solve. For a direct solve, it is hard but is something that I've love to implement!

I've we can get the iterative solves to be weighted then maybe we don't need the direct? Is there an example of ~20 lines that are need to do the solve with the Hmatrix approach? That would allow us to directly see how well it compares with the dense direct solve.

Sure! I think it'll mess up your notebook a little bit but you can recover it.

I merged it and runs just fine. Seeing in the TDE context is really helping me to understand what's going on.

It's actually more of a speed than a memory thing. Even without ACA, you can construct each far-field block and then do an SVD on it and then throw away the original far-field block. So, there's only one dense block in memory at a time which is probably 100kb or something.

This sounds like where the magic happens. Would this apply to the smoothing matrices too? Do they even exist in the Hmatrix formulation?

That is the step that constructs the far-field blocks. It's using ACA to construct approximate far-field blocks without constructing the original matrix. If you're reading through and have time, I'd love if you noted down the more confusing parts. That section is not the best writing/explanation and a few parts seem underexplained. There's also this section further down, 5.7, that uses cutde instead to compute the ACA approximate far-field blocks much faster: https://tbenthompson.com/book/tdes/hmatrix.html#faster-approximate-blocks-on-gpus-with-cutde

I'll read this and add notes early next week!

I think the initial list of steps is useful though. The approx_blocks call is step 4. "Solving" would be step 5, probably using an iterative method.

So are the approximate far-field entries stored in the tuple block? I'll have to update that variable name because block is already used for the block dataframe!

tbenthompson commented 2 years ago

Our weighting matrix is purely diagonal, and that's I just use a weighting vector in the sense direct case.

Exactly. You can just multiply the design matrix and the data vector by the weights before running the iterative solve. The wikipedia page on weighted least squares is actually decent: https://en.wikipedia.org/wiki/Weighted_least_squares

Is there an example of ~20 lines that are need to do the solve with the Hmatrix approach?

I can put this together! I think the best approach here would be for you to create a problem/notebook that is already solving a real block model problem iteratively and then I'll go in and modify the notebook to do the iterative solve via H-matrix. I think that'll be the most useful approach. Then you can play with that and piece together how that's working and I can walk through it with you.

Would this apply to the smoothing matrices too? Do they even exist in the Hmatrix formulation?

The H-matrix would only apply to the TDE (and possibly Okada) sub-blocks of the full inversion matrix. The smoothing portion is already sparse right? Even if it's currently being stored as a dense matrix, the underlying data is fundamentally sparse. There's no need to sparsify it.

brendanjmeade commented 2 years ago

Exactly. You can just multiply the design matrix and the data vector by the weights before running the iterative solve. The wikipedia page on weighted least squares is actually decent: https://en.wikipedia.org/wiki/Weighted_least_squares

I have to try it out. The original linear system is:

XB = y

then pre multiplying both sides by 'diag(w)'. This multiplication should be per row to give:

X'B = y'

Then just substitute this into the iterative solver

estimation.state_vector_unweighted_sparse_lsqr = scipy.sparse.linalg.lsqr(W', y')

I can put this together! I think the best approach here would be for you to create a problem/notebook that is already solving a real block model problem iteratively and then I'll go in and modify the notebook to do the iterative solve via H-matrix. I think that'll be the most useful approach. Then you can play with that and piece together how that's working and I can walk through it with you.

I agree that this will be the most useful approach. I'll integrate the decomposition for the TDEs in a real block model notebook. My goal (see below) will be to assemble all the matrices we need as sparse, along with this.

The H-matrix would only apply to the TDE (and possibly Okada) sub-blocks of the full inversion matrix. The smoothing portion is already sparse right? Even if it's currently being stored as a dense matrix, the underlying data is fundamentally sparse. There's no need to sparsify it.

Okada is a good idea too! It's so small it might not be worth it. We'll see! Smoothing is sparse. Slip rate constraints are sparse. Strain rates are sparse. I've got some work to do to assemble this. Would be world changing if this can work!

tbenthompson commented 2 years ago

This all sounds great!

brendanjmeade commented 2 years ago

@tbenthompson I think I've made progress in terms of getting things assembled H-matrix block model problem. See celeri_hmatrix.ipynb in commit: https://github.com/brendanjmeade/celeri/commit/4322ebe420e9635b3ec01f216afaf146545d8be2. The bottom few cells detail how the composite block model operator is structured and has nicely named sparse versions of all the relevant matrices. I've also moved over the H-matrix tree construction that you had built previously. I'm very uncertain about whether there should be additional H-matrix constructions steps, or if it's all about the solve at this point. Thoughts and suggestions are most welcome!

tbenthompson commented 2 years ago

Great! Sorry for the slow response. I just took a look at the notebook and it looks like it's in good shape for setting up an iterative solver. I think I should take over now. I think I'll have time this week to work on this. It seems like the plan should be:

Confirm that the sparse solver without H-matrix is reproducing the dense solver solutions to at least a few digits of precision.
Isolate the TDE block and confirm that the H-matrix matrix-vector product is reproducing the dense matrix-vector product.
Integrate the two components that are now independently working correctly: a) sparse solver and b) H-matrix vector product
Confirm that the final results are giving the same solution as the dense solver.
Benchmark and see how fast this is and identify if there's room to optimize.

Then, I think the final step is for you to take back over and

Find parts that are confusing where it would be useful for me to clarify or write comments or whatever.
Re-run the same code on a larger model (part or all of the global model?) and see what the performance looks like.

Does this sound reasonable?

brendanjmeade commented 2 years ago

I'm happy to hear that I was able to get the basic pieces in reasonable shape following your guidance!

Confirm that the sparse solver without H-matrix is reproducing the dense solver solutions to at least a few digits of precision.

This seems critical. In the notebook, you can see that I tried this with LSMR and didn't get the same result. The block motion part looks fine, but the TDEs barely move at all. I'm sort of thinking there's some preconditioning that should happen here to improve the condition number? This happens in the code cell immediately below the markdown cell that reads: # Try iterative solve on full system cast to sparse - Runs and block motions look fine but TDEs are way off.

Benchmark and see how fast this is and identify if there's room to optimize.

I'm excited for the H-matrix just for the memory savings with larger problems. I think as we scale up to the Japan model (see below) we'll be able to get a better sense of how it scales with additional TDE meshes and more triangles.

Then, I think the final step is for you to take back over and

Find parts that are confusing where it would be useful for me to clarify or write comments or whatever.

100%!

Re-run the same code on a larger model (part or all of the global model?) and see what the performance looks like. Does this sound reasonable?

Totally agree. I think the second model is likely to be a Japan model where have 3 subduction zones. That's a very well known model that generalizes the problem a little bit and will allow us to practice with multiple TDE meshes before jumping to the global model.