StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
685 stars 145 forks source link

Improve test coverage in Realm #1222

Open apryakhin opened 2 years ago

apryakhin commented 2 years ago

Filing an issue to discuss ways of improving test coverage in Realm. This is an initial post and I will fill it in with more details shortly.

apryakhin commented 2 years ago

The main objective for us is to keep Realm's codebase of the highest possible quality. We need to determine whether it will be beneficial to increase the test coverage, introduce tools to make it easy for the community to write GOOD tests.

Realm has already a number of comprehensive tests that cover the existing features and code paths. However, not everything is tested and many code paths are tested only indirectly. This makes it harder to reason about the capabilities of a specific feature, class, function or line of code. It's also not trivial to extend the existing tests when either a new code is added or the existing code is changed.

apryakhin commented 2 years ago

@lightsighter please let me know what you think. The proposal is pretty high level but we can add any of level of details here if need be.

lightsighter commented 2 years ago

We need to determine whether adopting an existing C++ unit testing framework would improve our development workflow..e.g. make it faster, make it easier, make it safer.

I'm not completely against this as long as whoever is deciding to implement it is going to support it for the entire Legion/Realm community. I personally think we should start small and just add tests to our existing CI infrastructure. At some point maybe it merits having a more formal test infrastructure, but in our current state I think it will be better just to start writing tests and letting the CI check them. We have a test.py script at the top-level of the repo that can run various different subsets of tests in the CI. You can also use that for manually running subsets of tests as well (e.g. all the Realm tests). Until we have some scalability or performance issue with test.py, let's just keep adding to it so we spend our time on writing tests and not building new infrastructure where our current infrastructure is "good enough". If you'd like to make an argument though as to why the current infrastructure is not "good enough", then by all means go ahead and do so.

We need to determine what has already been covered and what needs to covered .e.g. to quantify this work.

I think we have some code coverage results of Realm tests already. I will try to dig those up. If we're going to have a metric, then code coverage is the obvious one that comes to my mind. I would be quite satisfied if we had one test that ran each line of code (probably unrealistic since there are a lot of timing-dependent branches in Realm, but if we could get into the 90% range then that would be very good).

We need to determine the future format of tests..granularity etc. The proposal will be to cover every existing function per c++ class and source file.

I think this is a bit unrealistic, especially with a system as big as Realm. I think it be better to start off with writing a unit-test for each class and method in the user-facing API. It will be more directed and more likely to shake out bugs that Realm clients will see. If we finish that and we then want to go do the same thing for all internal classes and methods, we can do then do that, but best to start with what the user interacts with first.

elliottslaughter commented 2 years ago

I agree with all of @lightsighter's points, but just want to add a couple things:

eddy16112 commented 2 years ago

Besides functionality tests, I think it is also good to have performance tests probably once every week. We sometime can see performance of some applications drop between different Legion releases, having performance tests could help me catch any performance bugs earlier.

elliottslaughter commented 2 years ago

We actually have performance tests. The problems there are all infrastructure related:

eddy16112 commented 2 years ago

We actually have performance tests. The problems there are all infrastructure related:

  • Where do you run the tests? CI runners are very non-ideal for this. But running this on "real" machines is expensive (both in machine and human time). My last attempt to set this up on Titan was a disaster; the jobs kept dying, and I had to keep manually checking to make sure they stayed up. There is a constant operational cost to doing this so unless someone is going to commit to doing this, we're not going to be able to keep this available long-term.

How about sapling? We can pick weekend night to run performance tests without bothering other people's work. My only concern is that the old IB on sapling may hide some performance issues especially for overlapping communication with computation.

  • How do you visualize the results? We used to do this at https://legion.stanford.edu/perf-frontend/perf_chart.html (in fact, that link is still up), but our visualization is very non-scalable (as you can see if you load that link). I'm not aware of any prebuilt interfaces we can use here, which is why I built the above, but it needs more work to actually be useful.

I think we do not need to visualize the results all the time. We can record the "normal" performance for each test, and if the performance of any test changes a lot, e.g. 15% or more, we will get a notification and then we can manually check the results using the visualization tool.

  • How do you detect and notify on the results? Without notifications, you need to manually check the website, which we never did, so ultimately all of the performance monitoring infrastructure ever mattered. I think you'd need to solve this to make it practically useful.

The Open MPI group uses an automatic testing tool named mtt for nightly regression testing. The tool sends results to files or SQL databases. I am not sure if it has any kinds of performance monitoring mechanism, but I usually receive emails when tests fail. I can take a look at the tool.

lightsighter commented 2 years ago

Here are some code coverage results for Realm as it exists today.

First, the existing Realm and Legion C++ tests with multi-node with Networking and GPUs:

http://sapling.stanford.edu/~mebauer/coverage/coverage.realm.html

Then, the Regent test suite with multi-node but no GPUs:

http://sapling.stanford.edu/~mebauer/coverage/regent.html

elliottslaughter commented 2 years ago

@lightsighter Is that automated? Should we automate it?

@eddy16112 Sapling would catch some regressions but not others. Recently it seems like most of our bugs/performance issues only appear at large-ish node counts, though there are some occasional code generation bugs or ones related to static optimization (e.g., we've had problems in the past with the static branches for logging levels not working properly).

I don't think watching for large changes only would be sufficient. It would catch large changes, yes, but not smaller changes over time. This is the reason why I want the graph-over-time aspect: it'll make it hopefully obvious if we're on a gradual downward slope, or if the behavior we're observing is just noise. In the past, we saw a lot of noise and it made it hard to automate anything.

One other thing I'll add is operational complexity. I don't really want our solution to this problem to be, "let's run a database on Sapling". We've had data loss on Sapling multiple times. We literally lost a previous version of the performance tool for this reason. The performance tool that I proposed was specifically designed to use Github as the data store to avoid any operational complexity due to needing to maintain state on our own hardware. That works (and I think pretty decently), but right now the client isn't fast enough to keep up with the data. But if we replace it, I think it will be a step backwards if we do so by installing a persistent data store on a machine somewhere.

lightsighter commented 2 years ago

Is that automated? Should we automate it?

It's not automated right now. It's a bit of a pain to get gcov working, especially with Terra. If we wanted to automate it into the build system that would be good, but it's probably not worth the effort of automating it into the CI.

apryakhin commented 2 years ago

@lightsighter Could you please provide some pointers on how to generate/update the following coverage graphs

lightsighter commented 2 years ago

The simplest thing to do is just to have gcc dump coverage files for Realm and then run the test suite and post process the results with gcovr:

https://gcovr.com/en/stable/getting-started.html#getting-started

Mainly you just need to add the --coverage flag when compiling Realm. That should be enough for getting started with just the Realm and Legion C++ test suites. If we want to add the Regent test suite that was a bit harder and I don't remember all the steps I had to take to do that but I remember it wasn't pleasant.