ANTsX / ANTsPy

A fast medical imaging analysis library in Python with algorithms for registration, segmentation, and more.
https://antspyx.readthedocs.io
Apache License 2.0
628 stars 161 forks source link

Speed difference between binaries and python wrappers #703

Closed benoitberanger closed 2 weeks ago

benoitberanger commented 2 weeks ago

Describe the bug There is a massive difference of computation time between Terminal CLI and the Python wrappers for N4BiasFieldCorrection and DenoiseImage

From the terminal : N4BiasFieldCorrection : 22s DenoiseImage : 76s

From Python: N4BiasFieldCorrection : 33s DenoiseImage : 602s

To reproduce I use a classic 3DT2 in 0.8mm iso:

************************************************
Image name:          "in.nii"
************************************************
  Dimensions:        256 x 320 x 320
  Voxel size:        0.8 x 0.8 x 0.8
  Data strides:      [ 1 2 3 ]
  Format:            NIfTI-1.1
  Data type:         signed 16 bit integer (little endian)
  Intensity scaling: offset = 0, multiplier = 1
  Transform:                    1           0           0      -106.6
                                0           1           0       -93.8
                                0           0           1        -136
  comments:          TE=4.2e+02;Time=164035.610;phase=1

From the terminal :

cd /tmp/test_antspy
N4BiasFieldCorrection -i in.nii -o n4_in.nii -v 1
DenoiseImage -i n4_in.nii -o dn_n4_in.nii -v 1 -n Rician

From Python :

import ants

img = ants.image_read('/tmp/test_antspy/in.nii')
img = ants.n4_bias_field_correction(img, verbose=True)
img = ants.denoise_image(img, v=1)
ants.image_write(img, '/tmp/test_antspy/out.nii')

Expected behavior Since I built from source ANTs and ANTsPy, I would expect roughly the same computation. x1.5 N4BiasFieldCorrection is unexpected but ok, however x8 for DenoiseImage is very weird.

ANTsPy installation (please complete the following information):

Additional context When running both tests, I can see in htop that all 16 CPUs are running at 100%, with both Terminal CLI and Python wrappers. So it's not an obvious multi-threading problem.

cookpa commented 2 weeks ago

For DenoiseImage, it's a difference in defaults, the search radius is 2 in the CLI program and 3 in ANTsPy. If I call ants.denoise_image(img, r=2, v=1), the difference in performance is similar to that for N4.

@ntustison @stnava shall we harmonize defaults, which one to adopt? I'll go with faster (2) unless you have a preference to make antspy the standard.

ntustison commented 2 weeks ago

Yeah, I'm all for harmonizing the defaults and I'd go with what's in the original ANTs DenoiseImage.

gdevenyi commented 2 weeks ago

For denoiseimage in particular, I would hope you could normalize with the OG implementation minc_anlm from minc-toolkit-v2

cookpa commented 2 weeks ago

Is there a usage with defaults you could paste here?

ntustison commented 2 weeks ago

DenoiseImage was ported from Jose's original Matlab code. Minc code was not referenced at all.

gdevenyi commented 2 weeks ago

Well minc_anlm was written by/with Jose, given the citation has L Collins of the MNI as one of the senior authors 👍🏻

$ minc_anlm
This program implements adaptative non-local denoising algorithm published in
Jose V. Manjon, Pierrick Coupe, Luis Marti-Bonmati, D. Louis Collins, Montserrat Robles "Adaptive non-local means denoising of MR images with spatially varying noise levels" Journal of Magnetic Resonance Imaging Volume 31, Issue 1, pages 192–203, January 2010
 DOI: 10.1002/jmri.22003
cookpa commented 2 weeks ago

I profiled n4_bias_correction with the line_profiler / kernprof, the library function execution accounts for 99.4% of the execution time. So there's not a lot of work happening at the wrapper level.

ntustison commented 2 weeks ago

Yes, I realize that. But you were referring to a specific implementation in the context of defaults and that's why I clarified that it was Jose's original Matlab code.

ntustison commented 2 weeks ago

Just a bit more historical context---I happened to be invited by an MNI-adjacent friend for a get-together during MICCAI 2013 in Nagoya, Japan. Fortunately, I sat right across the table from Jose and, after discussing common interests (such our enjoyment of Luis Miguel), he realized I was "one of the ANTs guys" and he asked me if I would like to put his denoising algorithm in ANTs. I said sure and he pointed me to his Matlab code which I eventually ported to ITK-style. After the FreeSurfer folk began using the implementation in their pipeline a couple years ago, I asked Jose about the possibility of making it an ITK module and he was all for it.

cookpa commented 2 weeks ago

I think there might also be optimization differences in the ITK / ANTs compilation.

It seems CMAKE_BUILD_TYPE=Release is only set on Mac?

https://github.com/ANTsX/ANTsPy/blob/8911caac28030281c7405201b3ddf7feec457987/scripts/build_itk.sh#L4-L10

Also the ITK compilation has -O2 where an ANTs superbuild has -O3

https://github.com/ANTsX/ANTsPy/blob/8911caac28030281c7405201b3ddf7feec457987/scripts/build_itk.sh#L20-L21

benoitberanger commented 2 weeks ago

For DenoiseImage, it's a difference in defaults, the search radius is 2 in the CLI program and 3 in ANTsPy. If I call ants.denoise_image(img, r=2, v=1), the difference in performance is similar to that for N4.

Correct, I did not notice the difference with r default parameter.

Here is what I have now :

DenoiseImage r=2 r=3
CLI 76s 179s
Python 248s 602s
cookpa commented 2 weeks ago

Thanks for testing, @benoitberanger

Would you mind trying out #705 ? If you have the Github CLI, you can do

gh pr checkout 705

It appears to close the gap on my Mac.

benoitberanger commented 2 weeks ago

Using 550400a043773ea9c97bb9b2e73caf935d0f3f98 in #705 here is the new computation times :

DenoiseImage r=2
CLI 76s
Python 73s

So it went from 248s to 73s !

cookpa commented 2 weeks ago

Wow! Thanks for reporting this