Issue w. boolean images & multi-threading

Lestropie commented 10 years ago

Kerstin originally reported this one, but I've finally been able to reliably reproduce the error and therefore bug hunt.

So the initial problem was that the SF mask output from dwi2response was including some voxels outside the brain. After a lot of modification and a lot of digging, I've eventually traced it all the way back to lib/image/voxel.h:130: Changing from a threaded_copy() to a copy() fixes the problem.

It makes sense: with boolean image data you're reading the existing byte, making a modification, then writing back the byte; if two threads are accessing the same byte at a similar time, strange things can happen.

Reading should be OK; it's purely a problem of writing. So ideally we want to forbid multi-threading if the destination voxel type has a value_type of boolean. But this needs to be as all-encompassing as we can make it; hence why I'm looking for suggestions.

Lestropie commented 10 years ago

Also:

Knock knock! Race condition! Who's there?

jdtournier commented 10 years ago

Crap. I knew this would turn out to be an issue sooner or later... I was hoping since I hadn't come across any issues so far that it would be OK in practice, but Murphy has obviously decreed otherwise.

I assume if this is reproducible, that the axis with shortest stride is second on the list of axes to loop over? That's the only way I can see this issue being reproducible - anything else would be too sporadic.

Anyway, I had been thinking about ways to deal with this, and the write-back changes we made a while back will turn out to be useful to deal with this. We'll need to specialise the Image::Buffer class to use a uint8_t in the backbend, with the final write-back doing the conversion to bitwise in a single thread. Not too sure how to handle the read-only case though, it might be that we have to preload to uint8_t every time to avoid making things ridiculously complicated...

Lestropie commented 10 years ago

Pfft, you should know better than to entrust our fate to our good old mate Murphy :-P

By "reproducible", I mean that if I output an image at every iteration, one out of every few images has one or more sporadic voxels, and the locations always change. So it's still a bugger to test.

Nothing exotic about the data: DWI is [2 3 4 1], mask image comes out as [1 2 3]. I also noticed that the problem almost (but not entirely) disappears when running in debug mode, which is presumably just the slower execution.

Don't think specialising Image::Buffer would be adequate: problem's still going to occur for scratches and preloads. I was thinking more along the lines of handling it at the multi-threading stage: whenever constructing e.g. a ThreadedLoop, detect that the output voxel type has a value_type of boolean, and revert to a standard loop in that instance. It's unlikely that any threaded loop that's writing to a bitwise image is going to be CPU-intensive, so speed shouldn't be an issue. And anything that's not using an in-built templated threading mechanism is then the coder's problem, and should probably be dealt with using a non-threaded sink class.

jdtournier commented 10 years ago

Pfft, you should know better than to entrust our fate to our good old mate Murphy :-P

Thanks for the coding lesson. I'll be sure to bear that in mind in future.

By "reproducible", I mean that if I output an image at every iteration, one out of every few images has one or more sporadic voxels, and the locations always change. So it's still a bugger to test.

I'd assumed as much - race conditions are rarely so kind as to be reproducible. But the fact that the issue was present in some form in repeated runs is what I hadn't come across before.

Nothing exotic about the data: DWI is [2 3 4 1], mask image comes out as [1 2 3]. I also noticed that the problem almost (but not entirely) disappears when running in debug mode, which is presumably just the slower execution.

So the strides of the mask are [1 2 3], but what's the loop doing? Presumably it's looping over some other input image as well, in which case it should use that image's strides to decide what order to use to go over the axes. My guess is the inner-most axis in the loop (the one being executed within the thread) is not the mask's axis of smallest stride. If it were (as it typically would be), the chances of two threads actually writing to the same byte at the same time are very small. The chances of concurrent write would be increased a lot if all threads were writing to similar voxel offsets within the buffer at the same time - which would happen if they were looping over a non-contiguous axis in the mask, since their base offsets (i.e. the start offset of each thread's loop) would then be adjacent.

Don't think specialising Image::Buffer would be adequate: problem's still going to occur for scratches and preloads. I was thinking more along the lines of handling it at the multi-threading stage: whenever constructing e.g. a ThreadedLoop, detect that the output voxel type has a value_type of boolean, and revert to a standard loop in that instance. It's unlikely that any threaded loop that's writing to a bitwise image is going to be CPU-intensive, so speed shouldn't be an issue. And anything that's not using an in-built templated threading mechanism is then the coder's problem, and should probably be dealt with using a non-threaded sink class.

I can't see how we can avoid specialising all of these in some way. The ThreadedLoop has no knowledge of which images are used for input and which for output. I don't see how we can change that without some major ugly hack. Preloads are less of a problem since they're normally used read-only - although they can be used read-write (I think), I don't know of any application that does this (and it might be worth disabling altogether anyway - I can't remember if I ever got round to implementing the write-back for this case...). And more generally, there may very well be legitimate cases were a binary image is written to without using a ThreadedLoop (though to be fair I can't think of any...).

Thankfully, I think we can do this without too much pain. I'd be a matter of writing a couple of private functions that would wrap the allocation and access (and maybe other bits I can't think of right now - on the train), and provide appropriate specialisations for bool. We can then use these throughout for the buffer classes proper, without needing to explicitly specialise each of them (which would be a major headache, I'll grant you).

I'll give it a crack when I get to work...

[ Edited since my email client's automatic line wrapping doesn't play well with GitHub's Markdown... ]

jdtournier commented 10 years ago

By the way, could you tell me what command you were running to reproduce this? And maybe send me the data you were using for testing...? Thanks!

Lestropie commented 10 years ago

The loop is literally the write() function in Image::Voxel. So it's creating an Image::Buffer based on the local image (which is an Image::BufferScratch), and doing a straight threaded_copy(). So the input and output images both have [1 2 3], and I know that the scratch buffer contains uncorrupted data. Although I get where you're coming from RE: axes within the loop, I don't see how these particular data would do anything unusual in that regard.

The ThreadedLoop class may not know the voxel types at construction, but it should once one of the run() functions is called...? Although it wouldn't know input/output... :-/

We might simply be better off targeting only those applications where it's plausible that the output image may be boolean, and doing an explicit branch before any threading is done? There shouldn't be too many cases...

Debugging using dwi2response. If you uncomment cmd/dwi2response.cpp:336, that'll give you some more image outputs per unit time and therefore be more likely to spot an error. Run with -info to see actual voxel count per iteration to compare against mrinfo -mask (this isn't a purely direct comparison between the input and output images, but if you code up a voxel count calculation right before the image is output, it gives the same result). Image data in private message.

jdtournier commented 10 years ago

Everyone, please checkout the multithreaded-binary-image-writes branch (pull request #99), re-run ./configure && ./build, and let me know whether it worked. This would introduce the whole C++11 standard to MRtrix, there is a strong chance something will fall over... Please do this especially if you run MRtrix on a non-Linux system - I need to make sure any issues are ironed out before merging this to master.

jdtournier commented 10 years ago

OK, I guess we need to get this merged now. Unless there are major objections, I'll merge the C++11 branch to master in a couple of days. Note that I've made a few changes to how things are done, so it might be worth you having a look through the commits to figure out what happened... There's bound to be bugs, but things seem to work so far. I guess we won't know till we shove it down people's throats...

Last minute feedback welcome!

Lestropie commented 10 years ago

Haven't been able to get this to compile on my Windows machine.

ERROR: [CC] src/dwi/tractography/tracking/write_kernel.o

g++ -c -std=c++11 -DMRTRIX_WINDOWS -mms-bitfields -march=native -DMRTRIX_WORD64 -Wall -Wno-unused-function -Wno-unused-parameter -O2 -DNDEBUG -Isrc -Icmd -Ilib -Icmd -Ic:/gsl-1.15/include -DHAVE_INLINE src/dwi/tractography/tracking/write_kernel.cpp -o src/dwi/tractography/tracking/write_kernel.o

failed with output

In file included from lib/image/threaded_copy.h:26:0,
                 from lib/image/voxel.h:32,
                 from src/dwi/tractography/roi.h:32,
                 from src/dwi/tractography/properties.h:28,
                 from src/dwi/tractography/file_base.h:34,
                 from src/dwi/tractography/file.h:35,
                 from src/dwi/tractography/tracking/write_kernel.h:33,
                 from src/dwi/tractography/tracking/write_kernel.cpp:24:
lib/image/threaded_loop.h: In instantiation of 'void MR::Image::ThreadedLoop::run(Functor&, VoxelType& ...) [with Functor = MR::Image::{anonymous}::__copy_func; VoxelType = {MR::Image::Voxel<MR::Image::Buffer<float> >, MR::Image::Voxel<MR::Image::BufferPreload<float> >}; typename std::enable_if<(sizeof (VoxelType ...) != 0), int>::type <anonymous> = 0]':
lib/image/threaded_loop.h:416:13:   required from 'void MR::Image::ThreadedLoop::run(Functor&&, VoxelType&& ...) [with Functor = MR::Image::{anonymous}::__copy_func; VoxelType = {MR::Image::Voxel<MR::Image::Buffer<float> >&, MR::Image::Voxel<MR::Image::BufferPreload<float> >&}]'
lib/image/threaded_copy.h:97:9:   required from 'void MR::Image::threaded_copy_with_progress_message(const string&, InputVoxelType&, OutputVoxelType&, size_t, size_t, size_t) [with InputVoxelType = MR::Image::Voxel<MR::Image::Buffer<float> >; OutputVoxelType = MR::Image::Voxel<MR::Image::BufferPreload<float> >; std::string = std::basic_string<char>; size_t = long long unsigned int]'
lib/image/buffer_preload.h:150:121:   required from 'void MR::Image::BufferPreload<ValueType>::do_load(VoxType&) [with VoxType = MR::Image::Voxel<MR::Image::BufferPreload<float> >; ValueType = float]'
lib/image/buffer_preload.h:123:28:   required from 'void MR::Image::BufferPreload<ValueType>::init(const List&) [with ValueType = float; MR::Image::Stride::List = std::vector<long long int>]'
lib/image/buffer_preload.h:49:34:   required from 'MR::Image::BufferPreload<ValueType>::BufferPreload(const string&, const List&) [with ValueType = float; std::string = std::basic_string<char>; MR::Image::Stride::List = std::vector<long long int>]'
src/dwi/tractography/tracking/shared.h:91:28:   required from here
lib/image/threaded_loop.h:411:13: error: call of overloaded 'run_outer(MR::Image::{anonymous}::__RunFunctor<MR::Image::{anonymous}::__copy_func, MR::Image::Voxel<MR::Image::Buffer<float> >, MR::Image::Voxel<MR::Image::BufferPreload<float> > >&, const char [11])' is ambiguous
             run_outer (loop_thread, "run thread");
             ^
lib/image/threaded_loop.h:411:13: note: candidates are:
lib/image/threaded_loop.h:359:16: note: void MR::Image::ThreadedLoop::run_outer(Functor&, const string&) [with Functor = MR::Image::{anonymous}::__RunFunctor<MR::Image::{anonymous}::__copy_func, MR::Image::Voxel<MR::Image::Buffer<float> >, MR::Image::Voxel<MR::Image::BufferPreload<float> > >; std::string = std::basic_string<char>]
           void run_outer (Functor& functor, const std::string& thread_label = "unknown") 
                ^
lib/image/threaded_loop.h:373:16: note: void MR::Image::ThreadedLoop::run_outer(Functor&&, const string&) [with Functor = MR::Image::{anonymous}::__RunFunctor<MR::Image::{anonymous}::__copy_func, MR::Image::Voxel<MR::Image::Buffer<float> >, MR::Image::Voxel<MR::Image::BufferPreload<float> > >&; std::string = std::basic_string<char>]
           void run_outer (Functor&& functor, const std::string& thread_label = "unknown") {
                ^

Tried playing around with it, but without experience with variadic templates & rvalue references I'm afraid it's beyond me for now.

Otherwise am eager to get this going; have a couple of projects that are going to require atomic operations & possibly thread support. Could just branch from C++11 I guess though...

jdtournier commented 10 years ago

Sorry, maybe I should have made this more obvious: the multithreaded_binary_writes branch is gone, merged into the C++11 branch. The changes I was making were starting to have nothing to do with the name of the branch, so I more or less renamed it to C++11... Might be a good idea to run git fetch --all --prune to remove stale branches from your repo...

So yes, please fork from C++11, and if that doesn't work, let me know.

Lestropie commented 10 years ago

Yeah I managed to follow that. That error is as of today on C++11 branch, GCC 4.8.2. Compiles fine on my Arch laptop though. I might try to find a newer MinGW+QT download, see if a compiler update gets it.

jdtournier commented 10 years ago

OK, good. Sorry, didn't get the windows bit - even though it was on the first line of your post...

Sounds like a compiler issue if it works on a more recent version. My experience with GCC is it gets stricter over time, not more permissive, so I tend to get failures on more recent versions rather than the older ones. That said, I'll try to run it with the strict ANSI option, see if I can get it to fail on Arch too.

arnaudbore commented 10 years ago

OK, simple enough fix.

Just one small remaining issue on Windows:

STDERR: [CC] src\dwi/tractography/seeding/dynamic.o
In file included from src/dwi/tractography/seeding/dynamic.h:46:0,
                 from src\dwi/tractography/seeding/dynamic.cpp:24:
src/dwi/tractography/tracking/write_kernel.h: In destructor 'MR::DWI::Tractography::Tracking::WriteKernel::~WriteKernel()':
src/dwi/tractography/tracking/write_kernel.h:73:112: warning: unknown conversion type character 'z' in format [-Wformat=]
               fprintf (stderr, "\r%8zu generated, %8zu selected    [100%%]\n", writer.total_count, writer.count);
                                                                                                                ^
src/dwi/tractography/tracking/write_kernel.h:73:112: warning: unknown conversion type character 'z' in format [-Wformat=]
src/dwi/tractography/tracking/write_kernel.h:73:112: warning: too many arguments for format [-Wformat-extra-args]

jdtournier commented 10 years ago

OK, no idea how that happened, I thought I was logged in as myself... That last comment was from me - jdtournier, Arnaud has nothing to do with this, don't blame him...

Lestropie commented 10 years ago

Huh... Thought I'd already fixed that one. Meh, 1bb94b78 at least removes the warning, will bug out at 4 billion streamlines though. (Don't scoff, I've gone to 2...)

jdtournier commented 10 years ago

Does this help? http://stackoverflow.com/questions/10222899/how-to-print-64-bit-integer-in-gcc-4-4-1

jdtournier commented 10 years ago

Or this one: http://stackoverflow.com/questions/2844/how-do-you-printf-an-unsigned-long-long-int

Lestropie commented 10 years ago

Ah. For some reason I thought long long wasn't fully standardized. I'll see if that compiles happily on both platforms.

jdtournier commented 10 years ago

OK, turns out the fix I did to get the Image::ThreadedLoop to compile on Windows screws things up at runtime in certain cases. Back to the drawing board...

MRtrix3 / mrtrix3

Issue w. boolean images & multi-threading #98