ANTsX / ANTs

Advanced Normalization Tools (ANTs)
Apache License 2.0
1.21k stars 381 forks source link

Limited Use of Multithreading during SyN Registration (b-spline syn) #1017

Closed NitramNimod closed 2 years ago

NitramNimod commented 4 years ago

Hi! When using b-spline syn during ANTs registration only one cpu core is working, while normal SyN behaves in the way it should behave, when properly setting and using the ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS variable. I can see in the resource monitor a short cpu work load spike right at the beginning of each convergence step, but 99% of the time it's single core processing. This behavior is reproducible on a Mac running El Capitan (10.11.6) or an Ubuntu 18.04.4 LTS (GNU/Linux 5.3.0-53-generic x86_64). The Mac is running the latest ANTs version (2.3.3.dev168-g29bdf) compiled this morning, on Ubuntu slightly older (2.3.1.dev120-g2f09e) compiled in March. I've found an older similar thread which is already closed because the problem was solved by compiling against a newer ITK, which here obviously doesn't work. [https://github.com/ANTsX/ANTs/issues/604] As the b-spline syn takes a lot more time a multi-threading would be highly appreciated! :)

Thanks and best regards, Martin

ntustison commented 4 years ago

All multi-threading is handled by ITK. The SyN and B-spline SyN classes are derived from the same parent class with the only difference being the use of B-spline smoothing and that B-spline smoothing class has been there for ~10 years. Did you try compiling a fresh checkout directly on your machine?

NitramNimod commented 4 years ago

Yes, I was using the small installation script that is offered on the ANTs website. I suppose, the git command there will checkout the latest working version. For the Mac a few hours ago… There were no errors during compilation, everything else is working normal. How to solve this? Or what can I do to help?

ntustison commented 4 years ago

It's really hard to say what's going on because, as I said, multi-threading is handled by the ITK foundation. You might want to ask over at the ITK discussion forum.

cookpa commented 4 years ago

I've not come across similar problems since that old thread. Happy to test again with a reproducible example. But my knowledge of the underlying ITK behavior is limited, so we'd probably have to ask over there.

NitramNimod commented 4 years ago

Example in the sense of "data" or "parameters" for an antsRegistration-call?

cookpa commented 4 years ago

A reproducible example would include data and code.

NitramNimod commented 4 years ago

Ok, I've assembled a simple example, I just used the famous "anatomy of an antsRegistration call" call and changed the SyN stage to BSplineSyN:

antsRegistration --dimensionality 3 --float 0 --verbose 1 \ --output [T1w_to_T2Flair,T1w_to_T2_Flair_Warped.nii.gz] \ --interpolation Linear \ --winsorize-image-intensities [0.005,0.995] \ --use-histogram-matching 0 \ --initial-moving-transform [sub-01_T1w_brain.nii.gz,sub-01_FLAIR_brain.nii.gz,1] \ --transform Rigid[0.1] \ --metric MI[sub-01_T1w_brain.nii.gz,sub-01_FLAIR_brain.nii.gz,1,32,Regular,0.25] \ --convergence [1000x500x250x100,1e-6,10] \ --shrink-factors 8x4x2x1 \ --smoothing-sigmas 3x2x1x0vox \ --transform Affine[0.1] \ --metric MI[sub-01_T1w_brain.nii.gz,sub-01_FLAIR_brain.nii.gz,1,32,Regular,0.25] \ --convergence [1000x500x250x100,1e-6,10] \ --shrink-factors 8x4x2x1 \ --smoothing-sigmas 3x2x1x0vox \ --transform BSplineSyN[0.1,1.5,0,3] \ --metric CC[sub-01_T1w_brain.nii.gz,sub-01_FLAIR_brain.nii.gz,1,4] \ --convergence [100x70x50x20,1e-6,10] \ --shrink-factors 8x4x2x1 \ --smoothing-sigmas 3x2x1x0vox

As can be seen I'm trying to coregister a T1-weighted image with a T2 Flair image. For the linear stages the cpu workload always went to the allotted load, but during BSplineSyN it goes to one core on average, spiking in between to 2 or 3 cores. Here's the data, stripped for skull/face and meta data. sub-01_T1w_brain.nii.gz sub-01_FLAIR_brain.nii.gz

Thank you very much, best regards, Martin

ntustison commented 4 years ago

Thanks for providing this. I just ran it on my machine using multithreading and the number of threads stayed the same through all 3 stages. How are you monitoring thread use?

ntustison commented 4 years ago

However, I just noticed that it's running unexpectedly slower than typical and I noticed that you're using a spline distance of 1.5mm for human brains. Where did you get this parameter choice?

gdevenyi commented 4 years ago

I can also confirm I'm seeing single-thread processing during the BSpline Stage: image

gdevenyi commented 4 years ago

And regular SyN uses all cores: image

ntustison commented 4 years ago

Well, this is one possible difference that might point towards an explanation. As I mentioned, all the threading is handled by ITK, so if this is indeed the culprit, there's really nothing we can do on the ANTs side.

cookpa commented 4 years ago

Looks like many threads are being created, but they can't all run in parallel for some reason.

ntustison commented 4 years ago

I take that back. If that were the case, then N4 should see a similar throttling and I haven't noticed anything but should probably check. @gdevenyi --- if you get a chance, can you just verify N4 isn't seeing something similar? And, then, can you run the following on just some random displacement field and see what the thread usage is like?

# Gaussian smoothing
$ SmoothDisplacementField 3 $displacementField $output 3.0 
# Bspline smoothing
$ SmoothDisplacementField 3 $displacementField $output 1x1x1 4
cookpa commented 4 years ago

I adapted the script slightly. I found a similar result, multiple threads are spawned but only one is active. The CPU utilization goes up as I increase the spline spacing. I also turned on --float 1 to see if memory has an impact, but I don't think that made much difference.

export ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=4

if [[ ! -f T1w_to_T2_Flair_0GenericAffine.mat ]]; then
  $HOME/bin/ants-2.3.4/antsRegistration --dimensionality 3 --float 0 --verbose 1 \
    --output [T1w_to_T2_Flair_,T1w_to_T2_Flair_Warped.nii.gz] \
    --interpolation Linear \
    --winsorize-image-intensities [0.005,0.995] \
    --use-histogram-matching 0 \
    --initial-moving-transform [sub-01_T1w_brain.nii.gz,sub-01_FLAIR_brain.nii.gz,1] \
    --transform Rigid[0.1] \
    --metric MI[sub-01_T1w_brain.nii.gz,sub-01_FLAIR_brain.nii.gz,1,32,Regular,0.25] \
    --convergence [1000x500x250x100,1e-6,10] \
    --shrink-factors 8x4x2x1 \
    --smoothing-sigmas 3x2x1x0vox \
    --transform Affine[0.1] \
    --metric MI[sub-01_T1w_brain.nii.gz,sub-01_FLAIR_brain.nii.gz,1,32,Regular,0.25] \
    --convergence [1000x500x250x100,1e-6,10] \
    --shrink-factors 8x4x2x1 \
    --smoothing-sigmas 3x2x1x0vox
fi

$HOME/bin/ants-2.3.4/antsRegistration --dimensionality 3 --float 1 --verbose 1 \
  --output [T1w_to_T2_Flair_Spline,T1w_to_T2_Flair_SplineWarped.nii.gz] \
  --initial-moving-transform T1w_to_T2_Flair_0GenericAffine.mat \
  --transform BSplineSyN[0.1,32,0,3] \
  --metric CC[sub-01_T1w_brain.nii.gz,sub-01_FLAIR_brain.nii.gz,1,4] \
  --convergence [100x70x50x20,1e-6,10] \
  --shrink-factors 8x4x2x1 \
  --smoothing-sigmas 3x2x1x0vox
ntustison commented 4 years ago

Just so people have an idea of the pieces involved:

I wrote much of the B-spline code in ITK except for the interpolator and original FFD-style image registration. All of the higher-level B-spline applications (e.g., N4, B-spline smoothing, B-splinev4/BsplineSyN image registration) use this filter as the underlying fitting/smoothing workhorse. I'm pretty sure that I remember correctly that this is the only filter that has explicit multithreading code. With SyN vs. B-spline SyN, although there are some differences in point-set handling, the only differences for the displacement field-only case is the actual smoothing. SyN uses Gaussian smoothing and BSplineSyN calls this analogous function which calls this filter which simply uses the B-spline workhorse mentioned above. As I mentioned above, N4 uses the same multi-threaded workhorse so if that's working properly, it's really puzzling to me that B-spline SyN wouldn't also work.

gdevenyi commented 4 years ago

SmoothDisplacementField 3 $displacementField $output 3.0 SmoothDisplacementField 3 T1w_to_T2_Flair_1Warp.nii.gz smoothed.nii.gz 1x1x1 4

Both seem to alternate between all cores loaded and a single thread loaded.

The second does more than once, but I think that's because of the "4" levels, whereas the other call only has one.

ntustison commented 4 years ago

I've tried to reproduce for a different number of threads. Attached are two screenshots during the affine stage and bsplinesyn stage where I've specified 4 threads and it looks as expected. So, @cookpa , since you use a Mac as well, I'm guessing you see "n" threads activated in the Threads column of the Activity Monitor but only see a single bar active on your CPU usage monitor, correct? ![Uploading affine.png…](Affine stage) ![Uploading bsplinesyn.png…](B-spline stage)

gdevenyi commented 4 years ago

Looks like you hit the comment button before the picture uploads completed :)

cookpa commented 4 years ago

I'm remoting to my mac, using ps -cM [pid]

USER    PID   TT   %CPU STAT PRI     STIME     UTIME COMMAND
pcook   871 s000  100.0 R    31T   0:01.94   5:04.04 antsRegistration
        871         0.0 S    31T   0:00.37   0:08.00 
        871         0.0 S    31T   0:00.37   0:08.10 
        871         0.0 S    31T   0:00.38   0:08.09 
        871         0.0 S    31T   0:00.37   0:08.03 

There's multiple threads but only one is running, and top shows 100% CPU usage.

ntustison commented 4 years ago

![Uploading affine.png…](Affine stage)

bsplinesyn bsplinesyn
ntustison commented 4 years ago

@cookpa @gdevenyi Thanks.

gdevenyi commented 4 years ago

Limiting the threading with ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=4 does impact affine and SyN, but BSplineSyN is still one thread.

ntustison commented 4 years ago

@gdevenyi I don't understand what you're saying. The second image is of B-spline SyN. Actually both images are of B-spline SyN (my mistake).

gdevenyi commented 4 years ago

**When I change ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=4 on my system, affine and SyN are properly adjusted, BSpineSyN is still single threaded.

ntustison commented 4 years ago

@gdevenyi Okay, I wasn't disputing that. I was simply pointing out that I can't reproduce on my machine.

gdevenyi commented 4 years ago

Okay, so I'm wondering if this is a platform/build system detail here.

I'm on Linux 18.04, with gcc-10.1. Are you using gcc or clang-as-gcc on OSX?

I've also tested ITK_GLOBAL_DEFAULT_THREADER=Platform and ITK_GLOBAL_DEFAULT_THREADER=Pool with the same results. I'll have to rebuild with TBB to see if its different, however TBB with ANTs still segfaults randomly so I don't bother hacking it in right now.

cookpa commented 4 years ago

@ntustison using the script I posted I see CPU usage similar to yours. With the spacing of 1.5mm as in the original example, it hovers around 100%.

ntustison commented 4 years ago

@gdevenyi

(base) [ntustison@Infinite-Resignation Thu Jun 11 11:13:06] $ gcc --version Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/4.2.1 Apple clang version 11.0.0 (clang-1100.0.33.8) Target: x86_64-apple-darwin19.5.0 Thread model: posix InstalledDir: /Library/Developer/CommandLineTools/usr/bin

Okay, @cookpa , interesting. So with four threads it doesn't drop to a single CPU but you do see a disparity

Screen Shot 2020-06-11 at 11 16 44 AM
gdevenyi commented 4 years ago

Will try a build with clang-11 here

ntustison commented 4 years ago

Thanks @gdevenyi

gdevenyi commented 4 years ago

A clang-11 build on 18.04 shows the same symptoms.

ntustison commented 4 years ago

Too bad. But thanks @gdevenyi

chrisadamsonmcri commented 2 years ago

Ok I think I have found the slow piece of code. I added a bunch of timing statements to profile and found that in itkBSplineSyNImageRegistrationMethod.hxx the DisplacementField -> BSpline fitting was particularly slow.

Here is one iteration of a CC image matching

    ComputeUpdateField   ComputeMetricGradientField() Resampling 2:0
    ComputeUpdateField   ComputeMetricGradientField() this->m_Metric->Initialize():0
    ComputeUpdateField   ComputeMetricGradientField() GetValueAndDerivative():200
    ComputeUpdateField   ComputeMetricGradientField() gradientField bit:3
    ComputeUpdateField   ComputeMetricGradientField() 1:246
    ComputeUpdateField   fixedImageMasks[0]: 0
BSplineSmoother
num control points[0]: 26
num control points[1]: 34
num control points[2]: 26
Spline order:3
    ComputeUpdateField   bspliner->Update(): 5127
    ComputeUpdateField   BSplineSmoothDisplacementField() 2: 5128

The timings are in ms. So the Displacement field to BSpline fitting is taking 5 seconds and I did notice it was pegged on 1 CPU. The code lines responsible are in ITKv5/Modules/Registration/RegistrationMethodsv4/include/itkBSplineSyNImageRegistrationMethod.hxx at around line 453 in:

template <typename TFixedImage,
          typename TMovingImage,
          typename TOutputTransform,
          typename TVirtualImage,
          typename TPointSet>
typename BSplineSyNImageRegistrationMethod<TFixedImage, TMovingImage, TOutputTransform, TVirtualImage, TPointSet>::
  DisplacementFieldPointer
  BSplineSyNImageRegistrationMethod<TFixedImage, TMovingImage, TOutputTransform, TVirtualImage, TPointSet>::
    BSplineSmoothDisplacementField(const DisplacementFieldType * field,
                                   const ArrayType &             numberOfControlPoints,
                                   const WeightedMaskImageType * mask,
                                   const BSplinePointSetType *   gradientPointSet)

Here:

  bspliner->SetNumberOfControlPoints(numberOfControlPoints);
  bspliner->SetSplineOrder(this->m_FixedToMiddleTransform->GetSplineOrder());
  bspliner->SetNumberOfFittingLevels(1);
  bspliner->SetEnforceStationaryBoundary(true);
  bspliner->SetEstimateInverse(false);
  bspliner->Update();
ntustison commented 2 years ago

SmoothDisplacementField wraps this functionality. Does it exhibit the same behavior (i.e., slow, single-threaded)? Be sure to specify a n-D vector (e.g., 10x10x10 for a 3-D displacement field) for the variance_or_mesh_size_base_level option or it'll smooth with a Gaussian kernel.

cookpa commented 2 years ago

@chrisadamsonmcri what data and spline spacing are you using to observe the slowdown?

If I recall these experiments correctly, the appropriate number of threads were created but they did not execute in parallel as the spline resolution increased. Once the spline spacing got sufficiently small, the CPU usage approached 100% - making it appear to be single-threaded, but really all the threads were there, they were just running one at a time.

chrisadamsonmcri commented 2 years ago

Here is more information about the call I used

antsRegistration -v -d 3 -u 1 --verbose 1 --float 1 --collapse-output-transforms 1 \
    --initial-moving-transform [${OUTPUTPREFIX}_alberts_to_native_affine_reg0GenericAffine.mat,0] \
    --transform BSplineSyN[0.25,5,0,3] \
    --metric CC[ ${OUTPUTPREFIX}_t2w_restore_brain_dilated.nii.gz,$TEMPLATEDIR/template-40_brain_dilated.nii.gz,0.2,2 ] \
    --convergence [ 50x0,1e-6,100 ] --shrink-factors 2x1 --smoothing-sigmas 2x1vox --masks ${OUTPUTPREFIX}_brain_mask_dilated.nii.gz] \
    --output ${OUTPUTPREFIX}_alberts_to_native_skullstrip_reg

The moving image,$TEMPLATEDIR/template-40_brain_dilated.nii.gz is

dim1        117
dim2        159
dim3        126
dim4        1
pixdim1     0.859375
pixdim2     0.859375
pixdim3     0.859375

The fixed image ${OUTPUTPREFIX}_t2w_restore_brain_dilated.nii.gz is

dim1        182
dim2        243
dim3        177
dim4        1
pixdim1     0.629961
pixdim2     0.629961
pixdim3     0.629961

I can't send the specific images used for this test but I do get consistent behaviour across different images. Hope this helps.

chrisadamsonmcri commented 2 years ago

@ntustison I ran SmoothDisplacementField on a warp that was previously computed with the same size as the fixed image in the previous message

dim1        182
dim2        243
dim3        177
dim4        1
pixdim1     0.629961
pixdim2     0.629961
pixdim3     0.629961
pixdim4     0.000000

I used the following call:

SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1Warp.nii.gz tmp.nii.gz 26x34x26
Elapsed time: 30.6202
RMSE = 0.4206
  rmse[0] = 0.262028
  rmse[1] = 0.241287
  rmse[2] = 0.223518

I took 26x34x26 from the number of control points in my testing post from earlier. Observing top there was a brief period at the start with >500% CPU utilisation but it dropped to 100% for the majority of execution.

Using fewer control points was much faster and remained more multithreaded:

SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1Warp.nii.gz tmp.nii.gz 10x10x10
Elapsed time: 6.84312
RMSE = 0.743995
  rmse[0] = 0.413912
  rmse[1] = 0.436904
  rmse[2] = 0.43606
ntustison commented 2 years ago

Okay, I'd like to focus on the multi-threading (elapsed time is correlated with the number of control points). What do you mean it "remained more multi-threaded"? The number of threads isn't explicitly coded to change during the execution of the filter.

chrisadamsonmcri commented 2 years ago

A bit more information on what I mean by that. I ran top while running the smoother.

At the very start of the smoothing step all threads are active, for about 0.1 seconds:

  15012 addo      20   0 2538072 567748  12368 S  28.0   0.9   0:00.85 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+ 
  15004 addo      20   0 2538072 567748  12368 S  27.6   0.9   0:00.84 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+ 
  15005 addo      20   0 2538072 567748  12368 S  27.6   0.9   0:00.84 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+ 
  15006 addo      20   0 2538072 567748  12368 S  27.6   0.9   0:00.84 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+ 
  15007 addo      20   0 2538072 567748  12368 S  27.6   0.9   0:00.84 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+ 
  15008 addo      20   0 2538072 567748  12368 S  27.6   0.9   0:00.84 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+ 
  15009 addo      20   0 2538072 567748  12368 S  27.6   0.9   0:00.84 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+ 
  15010 addo      20   0 2538072 567748  12368 S  27.6   0.9   0:00.84 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+ 
  15015 addo      20   0 2538072 567748  12368 S  27.6   0.9   0:00.84 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+ 
  15017 addo      20   0 2538072 567748  12368 S  27.6   0.9   0:00.84 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+ 
  15019 addo      20   0 2538072 567748  12368 S  27.6   0.9   0:00.84 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+ 
  15020 addo      20   0 2538072 567748  12368 S  27.6   0.9   0:00.84 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+ 
  15021 addo      20   0 2538072 567748  12368 S  27.6   0.9   0:00.84 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+ 
  15000 addo      20   0 2538072 567748  12368 S  27.3   0.9   0:00.83 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+ 
  15011 addo      20   0 2538072 567748  12368 S  27.3   0.9   0:00.83 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+ 
  15016 addo      20   0 2538072 567748  12368 S  27.3   0.9   0:00.83 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+ 
  15018 addo      20   0 2538072 567748  12368 S  27.3   0.9   0:00.83 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+ 
  15003 addo      20   0 2538072 567748  12368 S  27.0   0.9   0:00.82 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+ 
  15013 addo      20   0 2538072 567748  12368 S  27.0   0.9   0:00.82 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+ 
  15014 addo      20   0 2538072 567748  12368 S  20.7   0.9   0:00.63 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1Wa

But after that there is only one active thread

 15249 addo      20   0 2538072 568056  12676 R  99.7   0.9   0:07.34 SmoothDisplacem...

It isnt the case that multiple threads are active and each one taking up very little CPU. Perhaps there is a loop that is running in serial? I'll try to put some timing information in the filter itself, will report back.

ntustison commented 2 years ago

Perhaps there is a loop that is running in serial?

Nothing that should take significant time. The displacement field B-spline smoothing filter is basically a wrapper for the more generalized B-spline approximation filter. A for loop in the former iterates through the input field and collects the data into a point-set data structure which calls the latter for fitting and sampling of the B-spline object. In the latter, both the approximation part and reconstruction part should be explicitly multi-threaded.

chrisadamsonmcri commented 2 years ago

It is this loop that is slow ITKv5/Modules/Filtering/ImageGrid/include/itkBSplineScatteredDataPointSetToImageFilter.hxx in BSplineScatteredDataPointSetToImageFilter<TInputPointSet, TOutputImage>::UpdatePointSet() at around line 990:

  while (ItIn != this->m_InputPointData->End())
  {

    PointType point;
    point.Fill(0.0);

    input->GetPoint(ItIn.Index(), &point);

    for (unsigned int i = 0; i < ImageDimension; ++i)
    {
      U[i] = static_cast<RealType>(totalNumberOfSpans[i]) * static_cast<RealType>(point[i] - this->m_Origin[i]) /
             (static_cast<RealType>(this->m_Size[i] - 1) * this->m_Spacing[i]);

      if (itk::Math::abs(U[i] - static_cast<RealType>(totalNumberOfSpans[i])) <= epsilon[i])
      {
        U[i] = static_cast<RealType>(totalNumberOfSpans[i]) - epsilon[i];
      }
      if (U[i] < NumericTraits<RealType>::ZeroValue() && itk::Math::abs(U[i]) <= epsilon[i])
      {
        U[i] = NumericTraits<RealType>::ZeroValue();
      }

      if (U[i] < NumericTraits<RealType>::ZeroValue() || U[i] >= static_cast<RealType>(totalNumberOfSpans[i]))
      {
        itkExceptionMacro("The collapse point component "
                          << U[i] << " is outside the corresponding parametric domain of [0, " << totalNumberOfSpans[i]
                          << ").");
      }
    }

    for (int i = ImageDimension - 1; i >= 0; i--)
    {
      if (Math::NotExactlyEquals(U[i], currentU[i]))
      {
        for (int j = i; j >= 0; j--)
        {
          this->CollapsePhiLattice(collapsedPhiLattices[j + 1], collapsedPhiLattices[j], U[j], j);
          currentU[j] = U[j];
        }
        break;
      }
    }

    this->m_OutputPointData->CastToSTLContainer()[ItIn.Index()] = collapsedPhiLattices[0]->GetPixel(startPhiIndex);
    ++ItIn;

  }

This runs in serial. I ran this command

~/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1Warp.nii.gz tmp.nii.gz 26x34x26
Smoothing starting
DisplacementFieldToBSplineImageFilter<TInputImage, TInputPointSet, TOutputImage>::GenerateData()
bsplinePhysicalDomainField: 0
if (this->m_EnforceStationaryBoundary && !this->m_UseInputFieldToDefineTheBSplineDomain): 33
if (inputField) 565
if (inputPointSet) 0
bspliner setup: 0
void BSplineScatteredDataPointSetToImageFilter<TInputPointSet, TOutputImage>::GenerateData()
BSplineScatteredDataPointSetToImageFilter<TInputPointSet, TOutputImage>::GenerateData(); output->Allocate(): 0
BSplineScatteredDataPointSetToImageFilter<TInputPointSet, TOutputImage>::GenerateData(); this->AfterThreadedGenerateData() 954
this->UpdatePointSet(); for (unsigned int i = 0; i < ImageDimension; ++i) 0
this->m_InputPointData->Size() 8064504
this->UpdatePointSet(); while (ItIn != this->m_InputPointData->End()) 28935ms

The output is all the timings I added in. The final 28935ms is the total elapsed time of the loop.

ntustison commented 2 years ago

This is interesting. I have a pretty good guess as to what's happening (although not the "why"). But first can you do me a favor and send me the screen dump from the following (with this image):

N4BiasFieldCorrection -d 3 -i mprage.nii -v 1

using ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=4?

ntustison commented 2 years ago

As a follow-up, I'll explain what I think is happening here. In this section of code, the B-spline object is being evaluated at each input parametric point. This is a relatively computational intense process as one has to find the weighted average based on the local neighborhood control point values and corresponding B-spline basis functions evaluated at that parametric point. For this situation involving 3-D B-spline objects and third order B-splines, evaluation at each point involves the sum of 4 4 4 = 64 control point * basis function products. (Cf. for the corresponding function I wrote in ITK). One can imagine that doing this at each voxel in an image can be quite time consuming.

However, for computational purposes in the BSplineScatteredDataApproximation filter, we took advantage of the fact that the B-spline parametric domain (i.e., ITK image) is a rectilinear grid and iterating along the same row or column maintains constant parametric values in all but one of the parametric directions such that now evaluation requires only the weighted average 4 control point * basis function products with some minor overhead when we change slices or rows during our iterating. That overhead is what the CollapsePhiLattice function is doing. This check is what is used to determine if we've gone to the next rows/columns or slices necessitating recalculation of the basis functions. If this check is erroneously met at every single point, then evaluation will reduce to the naive (i.e., more intense) computation.

I'd like to see what running N4 produces given that the same issue would apply.

chrisadamsonmcri commented 2 years ago

Will do. Is there any code that you would like timed or debug information to be printed?

ntustison commented 2 years ago

Yeah, just let me know the elapsed time given at the end of the screen dump from the command posted above.

chrisadamsonmcri commented 2 years ago

Elapsed time: 16.0856 CPU utilisation was about 350 consistently.

cookpa commented 2 years ago

I also ran the N4 command with the above parameters, and it uses about 350% CPU with 4 threads on my Intel Mac (2.3 GHz Quad-Core Intel Core i7).

Elapsed time: 28.4026
N4BiasFieldCorrection -d 3 -i mprage_hippmapp3r.nii.gz -v 1  100.36s user 0.73s system 353% cpu 28.623 total

If I reduce the spline distance to 5mm, as used in the registration call from @chrisadamsonmcri above, the CPU utilization drops slightly, but isn't bad

Elapsed time: 16.6224
N4BiasFieldCorrection -d 3 -b [ 5 ] -c [ 100,0 ] -i mprage_hippmapp3r.nii.gz   54.78s user 0.69s system 328% cpu 16.862 total

But if I drop the spline resolution to 1.5mm (as in the OP's command at the top of this issue), the utilization is reduced more substantially.

Elapsed time: 45.3176
N4BiasFieldCorrection -d 3 -b [ 1.5 ] -c [ 100,0 ] -i mprage_hippmapp3r.nii.gz  86.11s user 2.91s system 195% cpu 45.566 total
chrisadamsonmcri commented 2 years ago

Ok so I think I get what the optimisation is @ntustison

    for (int i = ImageDimension - 1; i >= 0; i--)
    {
      if (Math::NotExactlyEquals(U[i], currentU[i]))
      {
        for (int j = i; j >= 0; j--)
        {
          this->CollapsePhiLattice(collapsedPhiLattices[j + 1], collapsedPhiLattices[j], U[j], j);
          currentU[j] = U[j];
        }
        break;
      }
    }

In a lattice organisation of control points most of the CollapsePhiLattice evaluations will be skipped since when you move one row or one column 2 of the 3 U[j]'s will be unmodified so you skip it. I counted the times that happened and you get this:

~/dev/ANTs/ANTs-build-with-timings-for-bsplinesyn/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1Warp.nii.gz tmp.nii.gz 26x34x26
DisplacementFieldToBSplineImageFilter<TInputImage, TInputPointSet, TOutputImage>::GenerateData()
bsplinePhysicalDomainField: 0
if (this->m_EnforceStationaryBoundary && !this->m_UseInputFieldToDefineTheBSplineDomain): 31
if (inputField) 553
if (inputPointSet) 0
bspliner setup: 0
void BSplineScatteredDataPointSetToImageFilter<TInputPointSet, TOutputImage>::GenerateData()
BSplineScatteredDataPointSetToImageFilter<TInputPointSet, TOutputImage>::GenerateData(); output->Allocate(): 0
BSplineScatteredDataPointSetToImageFilter<TInputPointSet, TOutputImage>::GenerateData(); this->AfterThreadedGenerateData() 949
this->UpdatePointSet(); for (unsigned int i = 0; i < ImageDimension; ++i) 0
this->m_InputPointData->Size(): 8064504
CollapsePhiLatticeEvaluations: 8064504
NotEqualEvaluations: 13125704
this->UpdatePointSet(); while (ItIn != this->m_InputPointData->End()) 28531
BSplineScatteredDataPointSetToImageFilter<TInputPointSet, TOutputImage>::GenerateData(); this->UpdatePointSet(); 28531
BSplineScatteredDataPointSetToImageFilter<TInputPointSet, TOutputImage>::GenerateData();   if (this->m_DoMultilevel) 0
BSplineScatteredDataPointSetToImageFilter<TInputPointSet, TOutputImage>::GenerateData(); this->m_GenerateOutputImage: 1
BSplineScatteredDataPointSetToImageFilter<TInputPointSet, TOutputImage>::GenerateData(); this->SetPhiLatticeParametricDomainParameters(); 30
void BSplineScatteredDataPointSetToImageFilter<TInputPointSet, TOutputImage>::GenerateData() end 
bspliner->Update(); 29559
Elapsed time: 30.1587
RMSE = 0.4206
  rmse[0] = 0.262028
  rmse[1] = 0.241287
  rmse[2] = 0.223518

NotEqualEvaluations is the count of skipped evaluations which are indeed more than the number of points and indicative that the optimisation is being activated. However, it seems without multithreading performing that many evaluations is slow.

ntustison commented 2 years ago

Hey, thanks for the additional information. Let me mentally process this info a bit more before I take a deeper dive. Right now, one obvious question might be why UpdatePointSet() is not multi-threaded. I wrote this code over 10 years ago and I'm having difficulty remembering why but given all the multi-threading I wrote for the surrounding code, I'm guessing there's a reason. I had a procedure today so I'll postpone any code writing in the short term---previous attempts at sedated code writing were not very successful.