Mbed-TLS / mbedtls-test

MbedTLS testing
Apache License 2.0
4 stars 23 forks source link

Don't run Valgrind in pull request jobs #81

Open gilles-peskine-arm opened 1 year ago

gilles-peskine-arm commented 1 year ago

Valgrind takes a lot of CPU time. This is a problem on the CI in terms of timeouts and cost.

Asan and Msan mostly find the same things that Valgrind finds, but we've occasionally had bugs that Valgrind found but neither Asan nor Msan. So Valgrind has some usefulness, but not a lot.

Proposal up for discussion: run Valgrind only in nightly jobs, not in pull request jobs.

tom-daubney-arm commented 1 year ago

Do we have enough data to have an estimate of how frequently Valgrind finds something that Asan and Msan have missed?

gilles-peskine-arm commented 1 year ago

Unfortunately, no, we don't have data. CI logs take a lot of room so we purge them pretty quickly. This may appear in the Git history or in pull request comments, but finding relevant things would be difficult. Also, if a developer catches something locally, that wouldn't be reflected anywhere.

I do remember that a few years ago, we wondered about removing Valgrind testing completely, but decided against it because it had found something that no other tool found. That was about memory management. For constant-trace, I don't remember a bug that Valgrind found and Msan didn't, but we don't have a lot of constant-trace-tested code so we don't really have much experience.

tom-daubney-arm commented 1 year ago

Thanks for the information. So for me personally I think the threshold frequency of Valgrind-only bugs vs bugs found by all of the tools would have to be quite high (say more than 10%, arbitrary I know) to justify having Valgrind in every CI run, given the effect that running Valgrind has had on the CI. I think relegating Valgrind to run only on the nightly is an acceptable trade-off and therefore I support removing Valgrind from PR jobs.

mpg commented 1 year ago

The only case I remember is this bug in MSan where it has no problem with one of the argument points to memcpy() being derived from an uninitialized value, which I ran into when working on constant-flow Lucky 13 counter-measure (we also abuse MSan as a constant-flow checker the same way we do with valgrind), and is apparently unlikely to be fixed soon.

For what it's worth, it was not an actual bug that valgrind found and MSan missed, it was when I first wrote the tests before we had a constant-flow implementation, so I was expecting the tests to fail, and to my surprise one of them passed with MSan.

Anyway, I think it's fair to say that valgrind finding something MSan+ASan doesn't has been happening with frequency clearly less than once a year over the last few years.

daverodgman commented 1 year ago

IMO, the marginal gain from running Valgrind against every PR is extremely small (assuming we retain it in the nightlies). On the other hand, we have a pressing need to improve CI performance. So I am in favour of removing from the PR jobs.