ceres-solver / ceres-solver

A large scale non-linear optimization library
http://ceres-solver.org/
Other
3.64k stars 1.01k forks source link

Add support for GPU based sparse cholesky solvers. #759

Open sandwichmaker opened 2 years ago

sandwichmaker commented 2 years ago

cc: @joydeep-b

kvoronin commented 3 months ago

Hi @sandwichmaker and everyone!

As a part of this story, will you be interested in integrating cuDSS a GPU-accelerated sparse direct solver? It's an early-access but already has some promising results comparative to the CPU solvers.

As I see, ceres has a bunch of options for external solver packages and I am thinking, maybe cudss can be yet another one in that list.

What do you think?

sandwichmaker commented 3 months ago

@kvoronin I am happy to accept a patch adding support for cudss to ceres solver. Two things need to be true though.

  1. Performance numbers which demonstrate on easily available GPUs that show good performance. Previously CuSolverSp did not show good enough performance. There were also problems with the API which did not allow us to cache the symbolic factorization.
  2. I would cuDSS to be available in a stable release of CUDA (as in exit preview), otherwise keeping the build and CI working well becomes a pain.

Would you be willing to prototype and benchmark the performance?

kvoronin commented 2 months ago

@sandwichmaker thanks for the reply!

Yes, we're willing to work on a patch. Requirements you listed above are very reasonable. (The only thing to note is that cuDSS might not become a part of CUDA Toolkit as there is a version of the future where cuDSS goes distributed and CUDA Toolkit does not include distributed libraries as of now; but we definitely plan to get to a stable 1.0 release anyway)

We'll work on a prototype and once considered in a good shape, come back to you with questions about what is the right way to benchmark performance (I assume, it will be ceres + cudss option vs ceres + default/other options on ceres-specific use cases).

Meanwhile, would you be interested to learn more about cuDSS or share in a meeting with us more about ceres? While we consumed the available information and found ceres an attractive goal for possible integration, we're curious to learn more about where ceres is mostly used, what are the most important use cases for direct solvers within ceres [say, who will be the target audience in case we promote cuDSS as an option into ceres), what's the general perspective on GPU acceleration of ceres, etc.?

If so, we can schedule a meeting to talk about these things offline if you tell us what's a good email address for you.

sandwichmaker commented 2 months ago

I would be happy to talk about these issues. You can email me at sandwichmaker@gmail.com to schedule some time to converse about it.

S-o-T commented 2 weeks ago

I was wondering if i should spent some of my holidays time integrating cudss into ceres-solver (with downstream target of speeding-up colmap's BA). @kvoronin @sandwichmaker could you please share if any encouraging benchmarking results were achieved and integration work is in progress/to be expected in upstream in foreseeable future?

sandwichmaker commented 2 weeks ago

@S-o-T that would be fantastic. I do not have any benchmark numbers myself, but @kvoronin and others at nvidia should have some.

kvoronin commented 1 week ago

Hi @S-o-T and @sandwichmaker! We have some PoC for an integration patch but currently it stays internal. The intention to bring it to the public is as strong as it was though.

In terms of when, it depends on what is the "foreseeable future". I hope we will be able to open an MR within a month but there are some formalities to be satisfied first and it will be mostly for review purposes as the final integration, per my understanding of @sandwichmaker's words, should only happen where cuDSS 1.0.0 appears [which is in the works but it definitely will not appear within a month].

In terms of performance, we have internal data which showed ceres + cuDSS to be considerably (~3x) faster vs ceres + SuiteSparse for normal Cholesky (especially in the part for the actual linear solve) but there are unexpected slowdowns in the parts outside the solver [e.g., Schur elimination] for the Schur algorithm which make the speedup low. We will look more into the details, I believe this can be fixed. And of course, we will need to do more extensive testing to make any confident claims about performance.

S-o-T commented 1 week ago

That is great!

I rushed a bit to do such integration, so you can inspect PR here https://ceres-solver-review.googlesource.com/c/ceres-solver/+/25800

I used bundle_adjuster to benchmark performance of sparse Cholesky backends and got following results. Overall, it seems to be on par with the numbers that you reported. bal_cudss_suitesparse_eigensparse.pdf

sandwichmaker commented 1 week ago

Thanks @kvoronin and @S-o-T. If this patch becomes the impetus for us to improve the performance of schur elimination, I would be delighted to have that be the case. @kvoronin would you be willing to take a first look at @S-o-T's change to see if it lines up with how you would do the integration? since you are more familiar with cuDSS than I am and have a better idea of canonical usage patterns.

I am happy to look at it too, and if it makes sense I am happy to work with @S-o-T to get it checked in, and then work to improve its performance on the pre-release version of cuDSS. I have no idea when the next stable release of Ceres Solver will be, so I am happy to explore cuDSS and improve its integration and performance in ceres in the interim.

kvoronin commented 1 week ago

Yes, @sandwichmaker, I will compare our local patch with what @S-o-T is suggesting and see if there is anything to change/add to the patch. Hopefully we will have best of both versions in the end. I'll make an update here once I review the PR from this perspective.