Closed NitramNimod closed 2 years ago
All multi-threading is handled by ITK. The SyN and B-spline SyN classes are derived from the same parent class with the only difference being the use of B-spline smoothing and that B-spline smoothing class has been there for ~10 years. Did you try compiling a fresh checkout directly on your machine?
Yes, I was using the small installation script that is offered on the ANTs website. I suppose, the git command there will checkout the latest working version. For the Mac a few hours ago… There were no errors during compilation, everything else is working normal. How to solve this? Or what can I do to help?
It's really hard to say what's going on because, as I said, multi-threading is handled by the ITK foundation. You might want to ask over at the ITK discussion forum.
I've not come across similar problems since that old thread. Happy to test again with a reproducible example. But my knowledge of the underlying ITK behavior is limited, so we'd probably have to ask over there.
Example in the sense of "data" or "parameters" for an antsRegistration-call?
A reproducible example would include data and code.
Ok, I've assembled a simple example, I just used the famous "anatomy of an antsRegistration call" call and changed the SyN stage to BSplineSyN:
antsRegistration --dimensionality 3 --float 0 --verbose 1 \ --output [T1w_to_T2Flair,T1w_to_T2_Flair_Warped.nii.gz] \ --interpolation Linear \ --winsorize-image-intensities [0.005,0.995] \ --use-histogram-matching 0 \ --initial-moving-transform [sub-01_T1w_brain.nii.gz,sub-01_FLAIR_brain.nii.gz,1] \ --transform Rigid[0.1] \ --metric MI[sub-01_T1w_brain.nii.gz,sub-01_FLAIR_brain.nii.gz,1,32,Regular,0.25] \ --convergence [1000x500x250x100,1e-6,10] \ --shrink-factors 8x4x2x1 \ --smoothing-sigmas 3x2x1x0vox \ --transform Affine[0.1] \ --metric MI[sub-01_T1w_brain.nii.gz,sub-01_FLAIR_brain.nii.gz,1,32,Regular,0.25] \ --convergence [1000x500x250x100,1e-6,10] \ --shrink-factors 8x4x2x1 \ --smoothing-sigmas 3x2x1x0vox \ --transform BSplineSyN[0.1,1.5,0,3] \ --metric CC[sub-01_T1w_brain.nii.gz,sub-01_FLAIR_brain.nii.gz,1,4] \ --convergence [100x70x50x20,1e-6,10] \ --shrink-factors 8x4x2x1 \ --smoothing-sigmas 3x2x1x0vox
As can be seen I'm trying to coregister a T1-weighted image with a T2 Flair image. For the linear stages the cpu workload always went to the allotted load, but during BSplineSyN it goes to one core on average, spiking in between to 2 or 3 cores. Here's the data, stripped for skull/face and meta data. sub-01_T1w_brain.nii.gz sub-01_FLAIR_brain.nii.gz
Thank you very much, best regards, Martin
Thanks for providing this. I just ran it on my machine using multithreading and the number of threads stayed the same through all 3 stages. How are you monitoring thread use?
However, I just noticed that it's running unexpectedly slower than typical and I noticed that you're using a spline distance of 1.5mm for human brains. Where did you get this parameter choice?
I can also confirm I'm seeing single-thread processing during the BSpline Stage:
And regular SyN uses all cores:
Well, this is one possible difference that might point towards an explanation. As I mentioned, all the threading is handled by ITK, so if this is indeed the culprit, there's really nothing we can do on the ANTs side.
Looks like many threads are being created, but they can't all run in parallel for some reason.
I take that back. If that were the case, then N4 should see a similar throttling and I haven't noticed anything but should probably check. @gdevenyi --- if you get a chance, can you just verify N4 isn't seeing something similar? And, then, can you run the following on just some random displacement field and see what the thread usage is like?
# Gaussian smoothing
$ SmoothDisplacementField 3 $displacementField $output 3.0
# Bspline smoothing
$ SmoothDisplacementField 3 $displacementField $output 1x1x1 4
I adapted the script slightly. I found a similar result, multiple threads are spawned but only one is active. The CPU utilization goes up as I increase the spline spacing. I also turned on --float 1
to see if memory has an impact, but I don't think that made much difference.
export ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=4
if [[ ! -f T1w_to_T2_Flair_0GenericAffine.mat ]]; then
$HOME/bin/ants-2.3.4/antsRegistration --dimensionality 3 --float 0 --verbose 1 \
--output [T1w_to_T2_Flair_,T1w_to_T2_Flair_Warped.nii.gz] \
--interpolation Linear \
--winsorize-image-intensities [0.005,0.995] \
--use-histogram-matching 0 \
--initial-moving-transform [sub-01_T1w_brain.nii.gz,sub-01_FLAIR_brain.nii.gz,1] \
--transform Rigid[0.1] \
--metric MI[sub-01_T1w_brain.nii.gz,sub-01_FLAIR_brain.nii.gz,1,32,Regular,0.25] \
--convergence [1000x500x250x100,1e-6,10] \
--shrink-factors 8x4x2x1 \
--smoothing-sigmas 3x2x1x0vox \
--transform Affine[0.1] \
--metric MI[sub-01_T1w_brain.nii.gz,sub-01_FLAIR_brain.nii.gz,1,32,Regular,0.25] \
--convergence [1000x500x250x100,1e-6,10] \
--shrink-factors 8x4x2x1 \
--smoothing-sigmas 3x2x1x0vox
fi
$HOME/bin/ants-2.3.4/antsRegistration --dimensionality 3 --float 1 --verbose 1 \
--output [T1w_to_T2_Flair_Spline,T1w_to_T2_Flair_SplineWarped.nii.gz] \
--initial-moving-transform T1w_to_T2_Flair_0GenericAffine.mat \
--transform BSplineSyN[0.1,32,0,3] \
--metric CC[sub-01_T1w_brain.nii.gz,sub-01_FLAIR_brain.nii.gz,1,4] \
--convergence [100x70x50x20,1e-6,10] \
--shrink-factors 8x4x2x1 \
--smoothing-sigmas 3x2x1x0vox
Just so people have an idea of the pieces involved:
I wrote much of the B-spline code in ITK except for the interpolator and original FFD-style image registration. All of the higher-level B-spline applications (e.g., N4, B-spline smoothing, B-splinev4/BsplineSyN image registration) use this filter as the underlying fitting/smoothing workhorse. I'm pretty sure that I remember correctly that this is the only filter that has explicit multithreading code. With SyN vs. B-spline SyN, although there are some differences in point-set handling, the only differences for the displacement field-only case is the actual smoothing. SyN uses Gaussian smoothing and BSplineSyN calls this analogous function which calls this filter which simply uses the B-spline workhorse mentioned above. As I mentioned above, N4 uses the same multi-threaded workhorse so if that's working properly, it's really puzzling to me that B-spline SyN wouldn't also work.
SmoothDisplacementField 3 $displacementField $output 3.0
SmoothDisplacementField 3 T1w_to_T2_Flair_1Warp.nii.gz smoothed.nii.gz 1x1x1 4
Both seem to alternate between all cores loaded and a single thread loaded.
The second does more than once, but I think that's because of the "4" levels, whereas the other call only has one.
I've tried to reproduce for a different number of threads. Attached are two screenshots during the affine stage and bsplinesyn stage where I've specified 4 threads and it looks as expected. So, @cookpa , since you use a Mac as well, I'm guessing you see "n" threads activated in the Threads column of the Activity Monitor but only see a single bar active on your CPU usage monitor, correct? ![Uploading affine.png…](Affine stage) ![Uploading bsplinesyn.png…](B-spline stage)
Looks like you hit the comment button before the picture uploads completed :)
I'm remoting to my mac, using ps -cM
[pid]
USER PID TT %CPU STAT PRI STIME UTIME COMMAND
pcook 871 s000 100.0 R 31T 0:01.94 5:04.04 antsRegistration
871 0.0 S 31T 0:00.37 0:08.00
871 0.0 S 31T 0:00.37 0:08.10
871 0.0 S 31T 0:00.38 0:08.09
871 0.0 S 31T 0:00.37 0:08.03
There's multiple threads but only one is running, and top shows 100% CPU usage.
![Uploading affine.png…](Affine stage)
@cookpa @gdevenyi Thanks.
Limiting the threading with ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=4 does impact affine and SyN, but BSplineSyN is still one thread.
@gdevenyi I don't understand what you're saying. The second image is of B-spline SyN. Actually both images are of B-spline SyN (my mistake).
**When I change ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=4 on my system, affine and SyN are properly adjusted, BSpineSyN is still single threaded.
@gdevenyi Okay, I wasn't disputing that. I was simply pointing out that I can't reproduce on my machine.
Okay, so I'm wondering if this is a platform/build system detail here.
I'm on Linux 18.04, with gcc-10.1. Are you using gcc or clang-as-gcc on OSX?
I've also tested ITK_GLOBAL_DEFAULT_THREADER=Platform and ITK_GLOBAL_DEFAULT_THREADER=Pool with the same results. I'll have to rebuild with TBB to see if its different, however TBB with ANTs still segfaults randomly so I don't bother hacking it in right now.
@ntustison using the script I posted I see CPU usage similar to yours. With the spacing of 1.5mm as in the original example, it hovers around 100%.
@gdevenyi
(base) [ntustison@Infinite-Resignation Thu Jun 11 11:13:06] $ gcc --version Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/4.2.1 Apple clang version 11.0.0 (clang-1100.0.33.8) Target: x86_64-apple-darwin19.5.0 Thread model: posix InstalledDir: /Library/Developer/CommandLineTools/usr/bin
Okay, @cookpa , interesting. So with four threads it doesn't drop to a single CPU but you do see a disparity
Will try a build with clang-11 here
Thanks @gdevenyi
A clang-11 build on 18.04 shows the same symptoms.
Too bad. But thanks @gdevenyi
Ok I think I have found the slow piece of code. I added a bunch of timing statements to profile and found that in itkBSplineSyNImageRegistrationMethod.hxx
the DisplacementField -> BSpline fitting was particularly slow.
Here is one iteration of a CC image matching
ComputeUpdateField ComputeMetricGradientField() Resampling 2:0
ComputeUpdateField ComputeMetricGradientField() this->m_Metric->Initialize():0
ComputeUpdateField ComputeMetricGradientField() GetValueAndDerivative():200
ComputeUpdateField ComputeMetricGradientField() gradientField bit:3
ComputeUpdateField ComputeMetricGradientField() 1:246
ComputeUpdateField fixedImageMasks[0]: 0
BSplineSmoother
num control points[0]: 26
num control points[1]: 34
num control points[2]: 26
Spline order:3
ComputeUpdateField bspliner->Update(): 5127
ComputeUpdateField BSplineSmoothDisplacementField() 2: 5128
The timings are in ms.
So the Displacement field to BSpline fitting is taking 5 seconds and I did notice it was pegged on 1 CPU. The code lines responsible are in ITKv5/Modules/Registration/RegistrationMethodsv4/include/itkBSplineSyNImageRegistrationMethod.hxx
at around line 453 in:
template <typename TFixedImage,
typename TMovingImage,
typename TOutputTransform,
typename TVirtualImage,
typename TPointSet>
typename BSplineSyNImageRegistrationMethod<TFixedImage, TMovingImage, TOutputTransform, TVirtualImage, TPointSet>::
DisplacementFieldPointer
BSplineSyNImageRegistrationMethod<TFixedImage, TMovingImage, TOutputTransform, TVirtualImage, TPointSet>::
BSplineSmoothDisplacementField(const DisplacementFieldType * field,
const ArrayType & numberOfControlPoints,
const WeightedMaskImageType * mask,
const BSplinePointSetType * gradientPointSet)
Here:
bspliner->SetNumberOfControlPoints(numberOfControlPoints);
bspliner->SetSplineOrder(this->m_FixedToMiddleTransform->GetSplineOrder());
bspliner->SetNumberOfFittingLevels(1);
bspliner->SetEnforceStationaryBoundary(true);
bspliner->SetEstimateInverse(false);
bspliner->Update();
SmoothDisplacementField
wraps this functionality. Does it exhibit the same behavior (i.e., slow, single-threaded)? Be sure to specify a n-D vector (e.g., 10x10x10 for a 3-D displacement field) for the variance_or_mesh_size_base_level
option or it'll smooth with a Gaussian kernel.
@chrisadamsonmcri what data and spline spacing are you using to observe the slowdown?
If I recall these experiments correctly, the appropriate number of threads were created but they did not execute in parallel as the spline resolution increased. Once the spline spacing got sufficiently small, the CPU usage approached 100% - making it appear to be single-threaded, but really all the threads were there, they were just running one at a time.
Here is more information about the call I used
antsRegistration -v -d 3 -u 1 --verbose 1 --float 1 --collapse-output-transforms 1 \
--initial-moving-transform [${OUTPUTPREFIX}_alberts_to_native_affine_reg0GenericAffine.mat,0] \
--transform BSplineSyN[0.25,5,0,3] \
--metric CC[ ${OUTPUTPREFIX}_t2w_restore_brain_dilated.nii.gz,$TEMPLATEDIR/template-40_brain_dilated.nii.gz,0.2,2 ] \
--convergence [ 50x0,1e-6,100 ] --shrink-factors 2x1 --smoothing-sigmas 2x1vox --masks ${OUTPUTPREFIX}_brain_mask_dilated.nii.gz] \
--output ${OUTPUTPREFIX}_alberts_to_native_skullstrip_reg
The moving image,$TEMPLATEDIR/template-40_brain_dilated.nii.gz
is
dim1 117
dim2 159
dim3 126
dim4 1
pixdim1 0.859375
pixdim2 0.859375
pixdim3 0.859375
The fixed image ${OUTPUTPREFIX}_t2w_restore_brain_dilated.nii.gz
is
dim1 182
dim2 243
dim3 177
dim4 1
pixdim1 0.629961
pixdim2 0.629961
pixdim3 0.629961
I can't send the specific images used for this test but I do get consistent behaviour across different images. Hope this helps.
@ntustison I ran SmoothDisplacementField
on a warp that was previously computed with the same size as the fixed image in the previous message
dim1 182
dim2 243
dim3 177
dim4 1
pixdim1 0.629961
pixdim2 0.629961
pixdim3 0.629961
pixdim4 0.000000
I used the following call:
SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1Warp.nii.gz tmp.nii.gz 26x34x26
Elapsed time: 30.6202
RMSE = 0.4206
rmse[0] = 0.262028
rmse[1] = 0.241287
rmse[2] = 0.223518
I took 26x34x26
from the number of control points in my testing post from earlier.
Observing top
there was a brief period at the start with >500% CPU utilisation but it dropped to 100% for the majority of execution.
Using fewer control points was much faster and remained more multithreaded:
SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1Warp.nii.gz tmp.nii.gz 10x10x10
Elapsed time: 6.84312
RMSE = 0.743995
rmse[0] = 0.413912
rmse[1] = 0.436904
rmse[2] = 0.43606
Okay, I'd like to focus on the multi-threading (elapsed time is correlated with the number of control points). What do you mean it "remained more multi-threaded"? The number of threads isn't explicitly coded to change during the execution of the filter.
A bit more information on what I mean by that.
I ran top
while running the smoother.
At the very start of the smoothing step all threads are active, for about 0.1 seconds:
15012 addo 20 0 2538072 567748 12368 S 28.0 0.9 0:00.85 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+
15004 addo 20 0 2538072 567748 12368 S 27.6 0.9 0:00.84 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+
15005 addo 20 0 2538072 567748 12368 S 27.6 0.9 0:00.84 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+
15006 addo 20 0 2538072 567748 12368 S 27.6 0.9 0:00.84 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+
15007 addo 20 0 2538072 567748 12368 S 27.6 0.9 0:00.84 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+
15008 addo 20 0 2538072 567748 12368 S 27.6 0.9 0:00.84 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+
15009 addo 20 0 2538072 567748 12368 S 27.6 0.9 0:00.84 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+
15010 addo 20 0 2538072 567748 12368 S 27.6 0.9 0:00.84 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+
15015 addo 20 0 2538072 567748 12368 S 27.6 0.9 0:00.84 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+
15017 addo 20 0 2538072 567748 12368 S 27.6 0.9 0:00.84 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+
15019 addo 20 0 2538072 567748 12368 S 27.6 0.9 0:00.84 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+
15020 addo 20 0 2538072 567748 12368 S 27.6 0.9 0:00.84 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+
15021 addo 20 0 2538072 567748 12368 S 27.6 0.9 0:00.84 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+
15000 addo 20 0 2538072 567748 12368 S 27.3 0.9 0:00.83 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+
15011 addo 20 0 2538072 567748 12368 S 27.3 0.9 0:00.83 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+
15016 addo 20 0 2538072 567748 12368 S 27.3 0.9 0:00.83 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+
15018 addo 20 0 2538072 567748 12368 S 27.3 0.9 0:00.83 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+
15003 addo 20 0 2538072 567748 12368 S 27.0 0.9 0:00.82 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+
15013 addo 20 0 2538072 567748 12368 S 27.0 0.9 0:00.82 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+
15014 addo 20 0 2538072 567748 12368 S 20.7 0.9 0:00.63 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1Wa
But after that there is only one active thread
15249 addo 20 0 2538072 568056 12676 R 99.7 0.9 0:07.34 SmoothDisplacem...
It isnt the case that multiple threads are active and each one taking up very little CPU. Perhaps there is a loop that is running in serial? I'll try to put some timing information in the filter itself, will report back.
Perhaps there is a loop that is running in serial?
Nothing that should take significant time. The displacement field B-spline smoothing filter is basically a wrapper for the more generalized B-spline approximation filter. A for loop in the former iterates through the input field and collects the data into a point-set data structure which calls the latter for fitting and sampling of the B-spline object. In the latter, both the approximation part and reconstruction part should be explicitly multi-threaded.
It is this loop that is slow ITKv5/Modules/Filtering/ImageGrid/include/itkBSplineScatteredDataPointSetToImageFilter.hxx
in BSplineScatteredDataPointSetToImageFilter<TInputPointSet, TOutputImage>::UpdatePointSet()
at around line 990:
while (ItIn != this->m_InputPointData->End())
{
PointType point;
point.Fill(0.0);
input->GetPoint(ItIn.Index(), &point);
for (unsigned int i = 0; i < ImageDimension; ++i)
{
U[i] = static_cast<RealType>(totalNumberOfSpans[i]) * static_cast<RealType>(point[i] - this->m_Origin[i]) /
(static_cast<RealType>(this->m_Size[i] - 1) * this->m_Spacing[i]);
if (itk::Math::abs(U[i] - static_cast<RealType>(totalNumberOfSpans[i])) <= epsilon[i])
{
U[i] = static_cast<RealType>(totalNumberOfSpans[i]) - epsilon[i];
}
if (U[i] < NumericTraits<RealType>::ZeroValue() && itk::Math::abs(U[i]) <= epsilon[i])
{
U[i] = NumericTraits<RealType>::ZeroValue();
}
if (U[i] < NumericTraits<RealType>::ZeroValue() || U[i] >= static_cast<RealType>(totalNumberOfSpans[i]))
{
itkExceptionMacro("The collapse point component "
<< U[i] << " is outside the corresponding parametric domain of [0, " << totalNumberOfSpans[i]
<< ").");
}
}
for (int i = ImageDimension - 1; i >= 0; i--)
{
if (Math::NotExactlyEquals(U[i], currentU[i]))
{
for (int j = i; j >= 0; j--)
{
this->CollapsePhiLattice(collapsedPhiLattices[j + 1], collapsedPhiLattices[j], U[j], j);
currentU[j] = U[j];
}
break;
}
}
this->m_OutputPointData->CastToSTLContainer()[ItIn.Index()] = collapsedPhiLattices[0]->GetPixel(startPhiIndex);
++ItIn;
}
This runs in serial. I ran this command
~/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1Warp.nii.gz tmp.nii.gz 26x34x26
Smoothing starting
DisplacementFieldToBSplineImageFilter<TInputImage, TInputPointSet, TOutputImage>::GenerateData()
bsplinePhysicalDomainField: 0
if (this->m_EnforceStationaryBoundary && !this->m_UseInputFieldToDefineTheBSplineDomain): 33
if (inputField) 565
if (inputPointSet) 0
bspliner setup: 0
void BSplineScatteredDataPointSetToImageFilter<TInputPointSet, TOutputImage>::GenerateData()
BSplineScatteredDataPointSetToImageFilter<TInputPointSet, TOutputImage>::GenerateData(); output->Allocate(): 0
BSplineScatteredDataPointSetToImageFilter<TInputPointSet, TOutputImage>::GenerateData(); this->AfterThreadedGenerateData() 954
this->UpdatePointSet(); for (unsigned int i = 0; i < ImageDimension; ++i) 0
this->m_InputPointData->Size() 8064504
this->UpdatePointSet(); while (ItIn != this->m_InputPointData->End()) 28935ms
The output is all the timings I added in. The final 28935ms is the total elapsed time of the loop.
This is interesting. I have a pretty good guess as to what's happening (although not the "why"). But first can you do me a favor and send me the screen dump from the following (with this image):
N4BiasFieldCorrection -d 3 -i mprage.nii -v 1
using ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=4
?
As a follow-up, I'll explain what I think is happening here. In this section of code, the B-spline object is being evaluated at each input parametric point. This is a relatively computational intense process as one has to find the weighted average based on the local neighborhood control point values and corresponding B-spline basis functions evaluated at that parametric point. For this situation involving 3-D B-spline objects and third order B-splines, evaluation at each point involves the sum of 4 4 4 = 64 control point * basis function products. (Cf. for the corresponding function I wrote in ITK). One can imagine that doing this at each voxel in an image can be quite time consuming.
However, for computational purposes in the BSplineScatteredDataApproximation filter, we took advantage of the fact that the B-spline parametric domain (i.e., ITK image) is a rectilinear grid and iterating along the same row or column maintains constant parametric values in all but one of the parametric directions such that now evaluation requires only the weighted average 4 control point * basis function products with some minor overhead when we change slices or rows during our iterating. That overhead is what the CollapsePhiLattice function is doing. This check is what is used to determine if we've gone to the next rows/columns or slices necessitating recalculation of the basis functions. If this check is erroneously met at every single point, then evaluation will reduce to the naive (i.e., more intense) computation.
I'd like to see what running N4 produces given that the same issue would apply.
Will do. Is there any code that you would like timed or debug information to be printed?
Yeah, just let me know the elapsed time given at the end of the screen dump from the command posted above.
Elapsed time: 16.0856 CPU utilisation was about 350 consistently.
I also ran the N4 command with the above parameters, and it uses about 350% CPU with 4 threads on my Intel Mac (2.3 GHz Quad-Core Intel Core i7).
Elapsed time: 28.4026
N4BiasFieldCorrection -d 3 -i mprage_hippmapp3r.nii.gz -v 1 100.36s user 0.73s system 353% cpu 28.623 total
If I reduce the spline distance to 5mm, as used in the registration call from @chrisadamsonmcri above, the CPU utilization drops slightly, but isn't bad
Elapsed time: 16.6224
N4BiasFieldCorrection -d 3 -b [ 5 ] -c [ 100,0 ] -i mprage_hippmapp3r.nii.gz 54.78s user 0.69s system 328% cpu 16.862 total
But if I drop the spline resolution to 1.5mm (as in the OP's command at the top of this issue), the utilization is reduced more substantially.
Elapsed time: 45.3176
N4BiasFieldCorrection -d 3 -b [ 1.5 ] -c [ 100,0 ] -i mprage_hippmapp3r.nii.gz 86.11s user 2.91s system 195% cpu 45.566 total
Ok so I think I get what the optimisation is @ntustison
for (int i = ImageDimension - 1; i >= 0; i--)
{
if (Math::NotExactlyEquals(U[i], currentU[i]))
{
for (int j = i; j >= 0; j--)
{
this->CollapsePhiLattice(collapsedPhiLattices[j + 1], collapsedPhiLattices[j], U[j], j);
currentU[j] = U[j];
}
break;
}
}
In a lattice organisation of control points most of the CollapsePhiLattice evaluations will be skipped since when you move one row or one column 2 of the 3 U[j]'s will be unmodified so you skip it. I counted the times that happened and you get this:
~/dev/ANTs/ANTs-build-with-timings-for-bsplinesyn/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1Warp.nii.gz tmp.nii.gz 26x34x26
DisplacementFieldToBSplineImageFilter<TInputImage, TInputPointSet, TOutputImage>::GenerateData()
bsplinePhysicalDomainField: 0
if (this->m_EnforceStationaryBoundary && !this->m_UseInputFieldToDefineTheBSplineDomain): 31
if (inputField) 553
if (inputPointSet) 0
bspliner setup: 0
void BSplineScatteredDataPointSetToImageFilter<TInputPointSet, TOutputImage>::GenerateData()
BSplineScatteredDataPointSetToImageFilter<TInputPointSet, TOutputImage>::GenerateData(); output->Allocate(): 0
BSplineScatteredDataPointSetToImageFilter<TInputPointSet, TOutputImage>::GenerateData(); this->AfterThreadedGenerateData() 949
this->UpdatePointSet(); for (unsigned int i = 0; i < ImageDimension; ++i) 0
this->m_InputPointData->Size(): 8064504
CollapsePhiLatticeEvaluations: 8064504
NotEqualEvaluations: 13125704
this->UpdatePointSet(); while (ItIn != this->m_InputPointData->End()) 28531
BSplineScatteredDataPointSetToImageFilter<TInputPointSet, TOutputImage>::GenerateData(); this->UpdatePointSet(); 28531
BSplineScatteredDataPointSetToImageFilter<TInputPointSet, TOutputImage>::GenerateData(); if (this->m_DoMultilevel) 0
BSplineScatteredDataPointSetToImageFilter<TInputPointSet, TOutputImage>::GenerateData(); this->m_GenerateOutputImage: 1
BSplineScatteredDataPointSetToImageFilter<TInputPointSet, TOutputImage>::GenerateData(); this->SetPhiLatticeParametricDomainParameters(); 30
void BSplineScatteredDataPointSetToImageFilter<TInputPointSet, TOutputImage>::GenerateData() end
bspliner->Update(); 29559
Elapsed time: 30.1587
RMSE = 0.4206
rmse[0] = 0.262028
rmse[1] = 0.241287
rmse[2] = 0.223518
NotEqualEvaluations
is the count of skipped evaluations which are indeed more than the number of points and indicative that the optimisation is being activated. However, it seems without multithreading performing that many evaluations is slow.
Hey, thanks for the additional information. Let me mentally process this info a bit more before I take a deeper dive. Right now, one obvious question might be why UpdatePointSet()
is not multi-threaded. I wrote this code over 10 years ago and I'm having difficulty remembering why but given all the multi-threading I wrote for the surrounding code, I'm guessing there's a reason. I had a procedure today so I'll postpone any code writing in the short term---previous attempts at sedated code writing were not very successful.
Hi! When using b-spline syn during ANTs registration only one cpu core is working, while normal SyN behaves in the way it should behave, when properly setting and using the ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS variable. I can see in the resource monitor a short cpu work load spike right at the beginning of each convergence step, but 99% of the time it's single core processing. This behavior is reproducible on a Mac running El Capitan (10.11.6) or an Ubuntu 18.04.4 LTS (GNU/Linux 5.3.0-53-generic x86_64). The Mac is running the latest ANTs version (2.3.3.dev168-g29bdf) compiled this morning, on Ubuntu slightly older (2.3.1.dev120-g2f09e) compiled in March. I've found an older similar thread which is already closed because the problem was solved by compiling against a newer ITK, which here obviously doesn't work. [https://github.com/ANTsX/ANTs/issues/604] As the b-spline syn takes a lot more time a multi-threading would be highly appreciated! :)
Thanks and best regards, Martin