How to measure geometric multigrid construction overhead

jiangzhongshi commented 3 years ago

Dear Firedrakers,

I am working on a geometric coarsening hierarchy algorithm and found your library to be a great place to test the performance of my approach. I love the pythonic interface a lot and would like to use firedrake for iterating my development. But I am a little worried about how much overhead it incurs, and whether it would bias my estimation of the real performance. I have a few questions and it would be great if I can get some insight, within the context of geometric multigrid.

How should I measure the overhead for firedrake (for example in the geometric multigrid, V-cycle Poisson tutorial), and what is the typical overhead in these problems (compare to a pure C implementation with PETSc). I have managed to use the PETSc logging to get a summary, is there a way to filter out some of the events (like the par_loop_*)?
From the trace of the program, it seems that there are frequent callbacks in python from PETSc, how does this affect the performance?
This might be a long shot, but is there a way to get a compiler-like behavior, in order to estimate the true performance?

I understand some of my questions are a bit too broad, but any suggestions will be very appreciated!

Zhongshi

wence- commented 3 years ago

@connorjward has been working on some of this performance stuff, and may be able to comment. One thing he has added (nearly available) is a flamegraph view of petsc logging data. This will enable seeing how the time breaks down inside the outermost solve.

In terms of performance in general. You can run your firedrake program with a sampling profiler like pyspy to see how it behaves. There is some overhead from being in Python, but it is mostly an affine cost, since we try and make sure that all the heavy work is devolved to compiled code. So if your problem is quite large, then most of the time will be in compute rather than cross-calling.

Thanks,

jiangzhongshi commented 3 years ago

Dear wence-,

Thanks so much for the quick response and suggestions.

I have tried with py-spy (without native option since it fails to merge frames), from the profile result of a simple V-cycle example I have (I can provide the script if it helps), it seems fine_node_to_coarse_node_map is taking a significant portion of the time (31%). Is this expected (in the sense that this will be significantly reduced in a C-based implementation), or is this some artifact since I was not able to enable the native option in pyspy?

Zhongshi

wence- commented 3 years ago

it seems fine_node_to_coarse_node_map is taking a significant portion of the time

This is (should be) a one-time setup cost that is run once per instance of the multigrid solver setup. It does some array manipulation (but most of it happens in C anyway). If you run your solver in a loop, do you see this proportion of time falling?

connorjward commented 3 years ago

Apologies for the delay in responding to this. @wence has already done a really good job of explaining most of what I would say.

At present I have found py-spy to be the best tool for this sort of thing. I do recommend using the --native flag if you can though because it gives a lot more insight into what is happening inside of PETSc. Without it enabled you will see some large blank areas in the flame graph.

Measuring the Python overhead, including in the callbacks, can be difficult because it is very problem dependent. For problems with a large number of DoF these overheads should be minimal. If you find that you are spending large amounts of time executing Python instead of C then there are likely ways to speed it up (e.g. by making sure that you're not creating a brand new solver object every timestep).

jiangzhongshi commented 3 years ago

Dear Lawrence and Connar,

Thanks again for the suggestion!

If you run your solver in a loop, do you see this proportion of time falling?

It seems that the wallclock subsequent runs are indeed much faster than the first run. The cost seems to be associated with the instance solve, and what are the possible ways to amortize the cost? For example, whether the cache can be re-used when changing to a different set of boundary conditions, or as the same precondition for different physics?

I do recommend using the --native flag

Yes, the profiling seems to be helpful, but I always encounter Failed to merge native and python frames. It seems that this is a known issue on the py-spy side, I wonder in the context of firedrake, do you have some way to circumvent this, for example, pass some flag to petsc compilation?

Best, Zhongshi

connorjward commented 3 years ago

Yes, the profiling seems to be helpful, but I always encounter Failed to merge native and python frames

Sorry I haven't encountered that before and I don't know a fix. Hopefully we should have an alternative tool soon pending some merges into PETSc.

jiangzhongshi commented 3 years ago

Noted, thanks so much for the help!

firedrakeproject / firedrake

How to measure geometric multigrid construction overhead #1976