Open daljit46 opened 8 months ago
I found an implementation of the natural logarithm of the gamma function in the Ruby language repository. After subsituting std::lgamma
with that implemenation, it seems that the race condition disappear.
The other alternative is use Boost, but it may be overkill for our needs.
Yikes. Okay. I was not aware of the std::lgamma
non-thread-safety. Having said that, I'd have thought that the code would be okay: if the surrounding code was doing as intended, it should have precluded this. So hopefully the issue can be rectified there.
The transformation from t-statistic / F-statistic to Z-statistic is computationally expensive. So I use pre-generation of a lookup table, on which linear interpolation is then performed. In the TestFixedHomoscedastic
case here, there's only one such lookup table, since the degrees of freedom is always the same. However because there are other more complex cases (different DoF for different elements being tested), I don't immediately generate this table up-front; instead, I generate a table whenever an element is tested that has a DoF for which a table has not yet been generated.
What I suspect it doesn't like is the fact that I'm trying to avoid mutexing every single time I read from one of those tables. What I do is (code):
So is the problem that the release of the mutex lock is insufficient for the newly generated data to become thread-synchronized? There's probably some other more appropriate design pattern. Been too long since I played with threading primitives. But since these transformations are done for every element (eg. voxel, fixel) for every permutation, I would like for it to be possible to read from shared data without having to thread lock every single time.
This is still only for the fixed homoscedastic case. If std::lgamma()
isn't thread-safe, then over and above whether the current code is doing as intended, I shouldn't be permitting two different threads to be generating two different tables at the same time.
It's also currently possible to disable use of the lookup tables by commenting this line. If std::lgamma()
is not thread safe on Linux, then it shouldn't be permissible to compile with that line uncommented. Unless we now include a non-std
lgamma()
implementation.
I shouldn't be permitting two different threads to be generating two different tables at the same time.
Yes, effectively we need to provide a global mutex (or the equivalent logic) that protects calls to lgamma
in the entire app (which is what I think what the Mac implementation is doing anyway).
This simple piece of code compiled with TSAN reports a race condition:
#include <cmath>
#include <iostream>
#include <thread>
int main ()
{
std::thread t1([]{
for (int i = 0; i < 100000000; ++i) {
std::lgamma(i);
}
});
std::thread t2([]{
for (int i = 0; i < 100000000; ++i) {
std::lgamma(i);
}
});
t1.join();
t2.join();
std::cout << "Done" << std::endl;
}
So I think we need to find an alternative to lgamma
or just ensure that no two threads are calling it simultaneously.
I wrote a small microbenchmark here to get a vague idea on how that implementation in the Ruby repo might perform and it appears reasonably fast compared to lgamma
. So using that may be an option (I'm not sure what are its accuracy guarantee, but since it's by the Ruby guys it may be reliable enough).
The code around the t->Z and F->Z conversions is complex precisely because of numerical accuracy problems. Indeed I added a unit test for the erfinv()
function for that reason. So if changing one of the underlying functions I'd definitely need to be verifying closely.
My concern is that addressing having two threads generating tables at the same time may not fully resolve: in the use case you pasted, there should only need to be one table generated, yet a problem is still reported. So presumably two threads are generating the same table at the same time despite the current checks.
Perhaps we should avoid std::lgamma()
as a first pass, given that I'd like for it to be possible to run without use of the lookup tables, and then if there's still detected threading problems, I can then look into the shared data synchronization.
I wrote a small microbenchmark here
Your benchmark runs the log-abs-gamma against log-gamma only, across positive values only. It's been too long since I did this work to recall whether compatibility with negative values is required here; but at least the former will have a code branch that the latter doesn't, which could bias the benchmark.
RE code structure:
I'm way out of practise here. I want threads to be able to generate these as needed, but for each unique table to be generated only once, by whichever thread needs it first, and for other threads to then synchrnoise with the shared data and be able to make use of them. And only one thread can be generating a table at any given time (unlike intent of current code, which tries to allow different threads to generate different tables at the same time). Importantly, data reads need to be cheap and non-locking wherever possible.
So I think I need something like:
?
Looking at the log gamma function and where it's invoked from, the only way that the negative sign bit could be set would be if either the DoF or the rank were to be between 2 and 4.
It might not necessarily be the Linux / Mac distinction: if MRTRIX_HAVE_EIGEN_UNSUPPORTED_SPECIAL_FUNCTIONS
is set, then MR::Math::betaincreg(double, double, double)
won't be called, an internal Eigen function is used instead.
On that point, Eigen
itself has an lgamma()
function, and that appears to be in the main Eigen namespace, not Unsupported. So that might be the better solution.
If we were to grab Eigen code in cmake
as per #2584, we could hard-code to grab a version that includes unsupported functions, in which case the std::lgamma()
calls would disappear.
I misread the definition of lgamma
, what it does is to compute the natural log of the absolute value of the gamma function. So the benchmark above is useless and silly. Here's a better benchmark, that tests values between -4 and 4 (skipping -3,-2,-1 as that the $\Gamma(z)$ would be undefined). The alternative implementation seems to be slightly slower but not by much (plus it's reentrant so can be safely used from multiple threads). Of course this small test is not really indicative of actual performance and far from conclusive. Only proper testing within our codebase will tell us about any performance differences.
To evaluate precision, I wrote a very simple (and naive) program to see how the value of the alternative implementation differs from std::lgamma
here and found that the we can expect a difference in values in the order $10^{-11}$. Of course, by no means this is conclusive, but it seems promising.
It might not necessarily be the Linux / Mac distinction: if MRTRIX_HAVE_EIGEN_UNSUPPORTED_SPECIAL_FUNCTIONS is set, then MR::Math::betaincreg(double, double, double) won't be called, an internal Eigen function is used instead.
In my testing, this was disabled. Furthermore, I cannot reproduce the race condition even with the simple two-threaded program I posted above on MacOS, but I can on Linux.
If we were to grab Eigen code in cmake as per https://github.com/MRtrix3/mrtrix3/issues/2584, we could hard-code to grab a version that includes unsupported functions, in which case the std::lgamma() calls would disappear.
That's good to hear! If Eigen has an independent implementation then using it would also be my preferred solution. I just hope that internally it doesn't call the OS' implementation of lgamma
. Will test later.
In the way that std::lgamma()
is used inside MR::Math::betaincreg()
, it's based on the degrees of freedom in the model, and the rank of the hypothesis, both of which are strictly positive (and DoF could be up to, hypothetically, say, 10,000).
There is some concern that the source of the code (website; repository) may be unaware of the distinction, and be using std::lgamma()
erroneously not realising that it yields the log of the absolute value...
Ok so testing with MRTRIX_HAVE_EIGEN_UNSUPPORTED_SPECIAL_FUNCTIONS
, I can see that the data race is no longer there. In the implementation of Eigen::lgamma
I can see that they are using lgamma_r
, which is the re-entrant version of lgamma
. If this is not present, they still call lgamma
internally.
Looking at their code, lgamma_r
should be available on Linux with glibc
>= 2.19, but not on Apple platforms. However, on mac OS the function seems to be implemented using a global mutex (e.g. see here) thus thread safe but inherently slower.
Should we just stick with the Eigen implementation and effectively keep using lgamma
on Mac OS? The other alternative is to use a different implementation of lgamma
(other than the Ruby repo, I also found an implementation in scipy
).
We could also do manual synchronisation ourselves, but I'm not sure if the effort would be worth it.
Bigger question is, in the case of having MRTRIX_HAVE_EIGEN_UNSUPPORTED_SPECIAL_FUNCTIONS
, whether Eigen::betainc()
actually makes use of any form of lgamma()
. If the implementation of that function doesn't rely on lgamma()
in any way, then if we can set up the compilation environment to guarantee its presence, then lgamma()
becomes a red herring, since there would no longer be any dependence on it.
whether Eigen::betainc() actually makes use of any form of lgamma().
This seems to be the case (e.g. see here).
OK so there's not likely to be any clean solution that isn't without downsides here.
Force use of Eigen::betainc()
via automatic build dependency (currently only used if available on system):
lgamma_r()
is missing only on a handful of platforms. The page you linked makes no mention of lgamma_r()
, so maybe those posters were just unaware of it. If it were not for mutex locking on Mac, this would be my preference. Can you confirm for sure that lgamma_r()
is absent on Mac? Because if it's there, we could for a quick fix change to using that function within MR::Math::betainc()
, with a vision of possibly removing that function later.Provide own implementation of log of absolute value of gamma function
std::lgamma()
Flag threading race condition to be ignored by sanitiser (We never read from the sign bit, so don't care if there is a race in what writes to it when)
Notably the Egein code seems to be using lgamma()
in a similar way. So if the signedness of the gamma function is relevant, Eigen has made the same error. Which probably makes it more likely that it's not a problem.
I think I've been tying myself in knots again. It's not about whether the logarithm of the gamma function is negative, but whether the gamma function itself is negative. Here the inputs (DoF / rank) are exclusively positive, in which case the gamma function is always positive. So in our use case what's written to the sign bit is not only never read from, but should also be 1
each and every time.
So I think the solution here is actually to tell thread sanitiser to ignore this race condition. It's inconsequential, and we don't want to incur mutex locking on Mac.
Also the regularised incomplete beta function is defined only for a, b > 0
.
@daljit46 I might need your help with cmake
here.
What I think I want to do is:
[ ] Have MR::Math::betaincreg()
throw an exception not only on x being outside of [0.0, 1.0], but also a or b not being positive
[ ] If MRTRIX_HAVE_EIGEN_UNSUPPORTED_SPECIAL_FUNCTIONS
is not defined:
(I'm currently not seeing where / how this is being set)
std::lgamma_r()
is present, and set an environment variable accordinglyMR::Math::betaincreg()
, #ifdefto use
std::lgamma_r()rather than
std::lgamma()` if available[ ] Either:
Previously in configure
we'd specify our own test commands to attempt to compile and run, and set envvars accordingly. With cmake
it looks like find_package()
and target_compile_definitions()
are doing a lot of the heavy lifting, so I'm not immediately seeing a template that I can modify to evaluate whether this function is available.
So I think the solution here is actually to tell thread sanitiser to ignore this race condition. It's inconsequential, and we don't want to incur mutex locking on Mac.
My opinion is that we should try to avoid this "solution". The problem is that in C++, a data race is undefined behaviour ans there are virtually zero guarantees that there will be no fatal consequences. Thus a C++ program is allowed to crash or worse even perform a nonsensical transformation (or even set fire to your device) without contradicting the ISO C++ specification. Even if we know that a specific implementation of the data race on a certain platform behaves as expected, there is no guarantee that this will hold with future versions of the compiler and the platform.
Additionally, I see that lgamma_r
is available in libc
on Mac OSX (at least since 10.5). So we could explicitly use it like this:
#include <iostream>
extern "C" double lgamma_r(double, int *);
int main()
{
int sign;
std::cout << lgamma_r(5.5, &sign) << std::endl;
return 0;
}
OK, I've had a quick look into this, and came across much the same posts as you already have. The main thing I was looking into was whether a data race that involves writes only, with no intended or actual reading of the racing variable, was likely to be a problem. As far as I can tell, all the examples of undefined behaviour I've seen involve getting the wrong value of the variable - which may indeed have nasty sides effects - but only if the code actually ever needs to read the variable. If all the code does is write to the variable, it really doesn't matter how it reorders operations - as least I can't see how any of the examples I've come across would apply.
There is however one interesting example that involves re-ordering operations and speculative early loading - and that seems to apply to precisely the use case that @Lestropie was talking about in this previous post, with this pseudo-code:
- Check for the presence of the required table
- If it doesn't exist:
- Acquire a mutex lock
- Check again for its presence (chance another thread may have created it between first check and our acquiring the lock)
- If it still doesn't exist, generate it
- Make use of table
This looks similar to the example in section 2.1 Double checks for lazy initialization of Hans-J. Boehm's How to miscompile programs with “benign” data races paper. Maybe that's all we need to fix, assuming the generation of the table is then guaranteed to be single-threaded...?
The lookup table generation may be a more difficult fix than just the use of std::lgamma()
vs. lgamma_r()
.
The theory there is that:
Therefore there may be different t->Z / F->Z transforms required.
What the current code is attempting to achieve is: for any given rank / degrees of freedom, the lookup table should be generated just once, and then made available for all threads to read from. The current design is however seemingly Bad.
Options:
Thread sanitizer has reported another data race in our codebase. Here is the output:
The issue seems to be triggered in the function
std::lgamma
incore/math/betainc.cpp:65
. It's not fully clear to me whether this is a problem with our code or not ( addressing #2795 would make the code much more readable and easier to debug). However, it seems likely that this is an issue in the function itself as apparentlystd::lgamma
is not guaranteed to be thread safe (e.g. see here). From cppreference:It's worth noting that I can only reproduce this on Linux, while on MacOS there is no race condition. This helps corrobrate the idea that issue lies in the speficic implementation of
lgamma
on Linux.