Closed gdevenyi closed 6 years ago
Perhaps you should do some investigation. There are a number of factors which could potentially be the cause, e.g.,
Please let us know what you find.
Answers:
Since search radius is different, test was re-run, marginal increase in speed, but still massively slow in comparison.
MINC-based script timings/resource usage:
Command being timed: "./test.sh 20170313_100217T1ws002a1001.nii.gz 20170313_100217T1ws002a1001.convert.denoise.nii"
User time (seconds): 218.33
System time (seconds): 2.24
Percent of CPU this job got: 83%
Elapsed (wall clock) time (h:mm:ss or m:ss): 4:24.61
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1655544
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 2
Minor (reclaiming a frame) page faults: 461868
Voluntary context switches: 2803
Involuntary context switches: 34796
Swaps: 0
File system inputs: 648
File system outputs: 326072
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
DenoiseImage
Command being timed: "DenoiseImage -d 3 -r 2 -i 20170313_100217T1ws002a1001.nii.gz -n Rician -o denoise.nii.gz --verbose"
User time (seconds): 23607.77
System time (seconds): 12.36
Percent of CPU this job got: 607%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:04:45
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 986676
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 247920
Voluntary context switches: 4252
Involuntary context switches: 3585308
Swaps: 0
File system inputs: 0
File system outputs: 171128
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
There were a couple issues but I don't know if that accounts for what you're seeing. If not, perhaps it's some ITK overhead or something particular to your system.
minc_anlm uses patch preselection based on the patch mean and variation, see: https://github.com/BIC-MNI/EZminc/blob/master/mincnlm/minc_anlm.cpp#L604 That speeds things up.
You can also check the speed on reimplementaton using ITK called itk_minc_nonlocal_filter
- it doesn't use preselection by default, source code is here: https://github.com/vfonov/patch_morphology/tree/master/src (it's included in minc-toolkit-v2)
Thanks @vfonov. I believe we have the same check. Perhaps the parameters differ causing variation in pass-through.
Perfect. I hadn't seen that ITK reimplementation.
it was an experiment.
I concur with @gdevenyi findings. I have build ANTs yesterday, tag v2.2.0, which is before the issues solving @ntustison mentioned, in release mode with ITK and VTK fetched by CMAKE.
By curiosity, I launched the denoising script on my 2mm isotropic, 64 b=1000 directions Brain DWI (which is very small compared to what I want to denoise in the future, 1mm iso, 100 directions...). After 20 hours of computation time on my 8 last generation i7 cores, I reached a 50% progression with this command:
DenoiseImage -i dwi.nii.gz -d 4 -o dwi_ants_denoised.nii.gz -n Rician -x dwi_brain_mask.nii.gz -v 1
Running for 4-dimensional images.
*********10*********20*********30*********40*********50
I thought it would be nice to let you know. I can't use your script for now, unfortunately, but I'll stay in tune for further advancements! If those issues get resolve, I could try again building the master. Thanks!
Thanks. I'll take a look. My guess is that it has something to do with ITK overhead.
Hello,
I can't claim that I completely understand the code, but looking at https://github.com/stnava/ANTs/blob/master/Utilities/itkAdaptiveNonLocalMeansDenoisingImageFilter.hxx#L363 makes me think that you implemented Blockwise algorithm from Jose with maximum block overlap (making it much slower then my implemnetation), i.e http://www.irisa.fr/vista/Papers/2008_TMI_Coupe.pdf with block stride of 1.
Also, in case of 4D images, it looks like filter is treating 4th dimension in the same way as spatial dimensions, which might be reasonable for fMRI scans, but probably not for DTI scans, where volumes are acquired with different gradient directions in arbitrary order. The time of execution here is ~ S P V B. where S is volume of search area, P - volume of patch and V - volume of the input file, B - is the block volume in the 4D case it would take ~ s^4 p^4 ( x y z t ) * p^4 time. where s- search radius and p - patch radius.
On another note, i set the itk nb of processing threads environment variable to 4 (half my nb of virtual cores) as suggested on the installation instructions.
Also, there would be no problem sharing my dataset for tests on Dropbox if you need!
Interesting @vfonov! Perhaps DWI (not DTI) scans should be treated as 3D data, using the given dimensionality parameter.
For DWI you may also want to look at a slicer module that I have as a stand alone package:
https://github.com/BRAINSia/JALMMSE
Joint anisotropic LMMSE for DWI denoising
Hans
--
From: "notifications@github.com" notifications@github.com Reply-To: stnava/ANTs reply@reply.github.com Date: Thursday, August 3, 2017 at 10:30 AM To: "nipype@noreply.github.com" ANTs@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [stnava/ANTs] DenoiseImage is unreasonably slow (#454)
Interesting @vfonov! Perhaps DWI (not DTI) scans should be treated as 3D data, using the given dimensionality parameter.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
Yes, I meant DWI.
Pierrick also made extension for DWI: https://sites.google.com/site/pierrickcoupe/softwares/denoising-for-medical-imaging/dwi-denoising
I can't claim that I completely understand the code, but looking at https://github.com/stnava/ANTs/blob/master/Utilities/itkAdaptiveNonLocalMeansDenoisingImageFilter.hxx#L363 makes me think that you implemented Blockwise algorithm from Jose with maximum block overlap (making it much slower then my implemnetation), i.e http://www.irisa.fr/vista/Papers/2008_TMI_Coupe.pdf with block stride of 1.
@vfonov Ah, thanks. That sounds familiar and why I asked @gdevenyi about subsampling. I think I saw that non-unit stride in Jose's code and implemented the shrink factor option as a hack. Can you show me in the minc code where that's done so I can investigate further? Thanks.
So, I didn't impelement block mode in my ITK code. In the non-itk version it is done in https://github.com/BIC-MNI/EZminc/blob/master/mincnlm/minc_anlm.cpp#L566 - note that indexes are incremented by 2 - that implements hard-coded block stride of 2 (i.e speedup of 8 for denoising).
Yep, that's what I was looking for. Thanks.
@gdevenyi --- What factor speed-up do you see between the two implementations?
@ntustison my bad, I didn't catch that there was a hard-coded subsampling in the code
MINC Based, including convert to-from: Elapsed (wall clock) time (h:mm:ss or m:ss): 0:4:24.61 DenosieImage: Elapsed (wall clock) time (h:mm:ss or m:ss): 1:04:45
$ time minc_anlm mc_fonv7706.2007-06-14.Z-50_nt1g.mnc.gz test_anlm.mnc Calculating means... Image maximum:709 Calculating variances... Launching: 1 thread(s)... Done...
real 0m36.928s user 0m36.768s sys 0m0.080s
4 Threads: $ time itk_minc_nonlocal_filter --preselect --anlm --flat mc_fonv7706.2007-06-14.Z-50_nt1g.mnc.gz test_itk_anlm.mnc --clob 0% ...............................................................................................................................................................................................................................10% ...................20% ...................30% ...................40% ...................50% .....................60% ...................70% ...................80% ...................90% ....................100% real 0m51.185s user 3m0.368s sys 0m0.264s
$ time DenoiseImage -n gaussian -d 3 -i mc_fonv7706.2007-06-14.Z-50_nt1g.mnc.gz -o test_denoise.mnc
real 4m27.495s user 14m46.420s sys 0m0.652s
Thanks all. I've made some changes and will commit when I get back from my run.
Hi, I just compiled the updated version of DenoiseImage and I am experiencing a segmentation fault, but only with some images. For instance, the program runs just fine the examples mentioned in the Insight Journal paper, but it fails on the CT scan from the 3DSlicer sample data collection (CT-chest)
Here is an excerpt from a run with gdb where the problem is said to occur at line 297:
Running for 3-dimensional images.
[New Thread 0x2aaab4b21700 (LWP 25480)]
[Thread 0x2aaab4b21700 (LWP 25480) exited]
[New Thread 0x2aaab4b21700 (LWP 25481)]
[Thread 0x2aaab4b21700 (LWP 25481) exited]
[New Thread 0x2aaab4b21700 (LWP 25482)]
[Thread 0x2aaab4b21700 (LWP 25482) exited]
[New Thread 0x2aaab4b21700 (LWP 25483)]
[Thread 0x2aaab4b21700 (LWP 25483) exited]
[New Thread 0x2aaab4b21700 (LWP 25484)]
*********10*********20*******
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x2aaab4b21700 (LWP 25484)]
itk::AdaptiveNonLocalMeansDenoisingImageFilter<itk::Image<float, 3u>, itk::Image<float, 3u>, itk::Image<unsigned char, 3u> >::ThreadedGenerateData (this=<optimized out>, region=..., threadId=<optimized out>)
at .../Utilities/itkAdaptiveNonLocalMeansDenoisingImageFilter.hxx:297
297 this->m_MeanImage->GetPixel( centerNeighborhoodPatchIndex );
Thanks
What's the exact command line call? Also, what is $ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS
?
Command line
DenoiseImage -v 1 -d 3 -o test.mha -i CT-chest.nrrd
Tried with ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS
equal to 2, 4, 6 and seg-faulted.
Tried again with ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=1
and completed successfully!
Okay, check it now. https://github.com/stnava/ANTs/commit/15163229396982118cb0d5a6ade08bfc46d03996
Fixed! Thanks
Resolved.
Running:
Seems to take an excessive amount of time.
In the meantime I can do:
While DenoiseImage makes about 5% progress, sharing the same CPUs.
All the MINC commands are single threaded to top it off.
I think there may be a very significant bottleneck somewhere.
Square difference map between DenoiseImage and minc_anlm: