Deep-MI / FastSurfer

PyTorch implementation of FastSurferCNN
Apache License 2.0
459 stars 119 forks source link

[Discussion] Run time discrepancy #16

Closed tashrifbillah closed 4 years ago

tashrifbillah commented 4 years ago

According to https://github.com/Deep-MI/FastSurfer#overview, (i) should take ~1 min while (ii) should take ~1 hour. On the other hand, they respectively took ~3 mins and ~3 hours for me.

So what machine/GPU did you use? My interest would be to bring the runtime down to the stipulated ones if they are at all possible.

FYI, I used this command:

time docker run --gpus all -v FastSurfer:/root/FastSurfer -v pwd:/root/data -v /home/tb571/freesurfer:/root/fs_license/ --rm fastsurfer:gpu --t1 /root/data/003_T1w_resampled.nii.gz --sd /root/FastSurfer/ --sid 003 --parallel --batch 4 --fs_license /root/fs_license/license.txt

And my data resolution:

[root@pnl-z840-2 fs_test]# fslinfo 003_T1w_resampled.nii.gz
data_type      FLOAT32
dim1           256
dim2           256
dim3           44
dim4           1
datatype       16
pixdim1        1.000000
pixdim2        1.000000
pixdim3        4.000000
pixdim4        1.000000
cal_max        0.0000
cal_min        0.0000
file_type      NIFTI-1+
NinjMenon commented 4 years ago

On a related note, when evaluating FastSurfer for large scale processing using CPUS, I'm getting times of ~1.5 hours for the segmentation alone on a single CPU. If this is expected, is there a way to speed up the segmentation piece while using only CPU, since the --parallel and the --threads options are only for the surface reconstruction pipeline?

Command:

./run_fastsurfer.sh --t1 /scratch/Niranjana/fastsurfer/inputs/212814/001.mgz --sid 212814_seg_only --sd /scratch/Niranjana/fastsurfer/outputs/ --batch 1 --no_cuda --seg /scratch/Niranjana/fastsurfer/outputs/212814_seg_only/aparc+aseg.mgz --seg_only

T1 data is 1 mm isotropic and 256 x 256 x 170 dims.

Much thanks for developing this awesome tool! In my group, a significant amount of time and resources go into manual edits of traditional FreeSurfer outputs, and we're excited to evaluate how FastSurfer can reduce our time burden :)

LeHenschel commented 4 years ago

Hey,

we use NVIDIA Titan Xp GPU with 12 ​GB RAM and Intel Xeon Gold 6154 @ 3 Ghz as specified in the paper. Maybe you can check the log-file to see which step takes up so much time. We get around 10s per network and approx. 15 s for the view aggregation on the GPU. On the CPU the segmentation takes 10-15 min. Hope this helps.

Best, Leonie

tashrifbillah commented 4 years ago

Did I get it right that you are talking about (i) FastSurferCNN step only?

tashrifbillah commented 4 years ago

Hi @NinjMenon ,

a significant amount of time and resources go into manual edits of traditional FreeSurfer outputs

What outputs do you refer to?

we're excited to evaluate how FastSurfer can reduce our time burden

I am also looking into this package's prospect in at least producing a full FreeSurfer alternative for volumetric analysis (within 1 minute) and surface-based thickness analysis (within only around 1h run time) as stipulated in their documentation. Can you let us know your finding?

NinjMenon commented 4 years ago

@tashrifbillah We often get segmentation errors due to atrophy (undercapture), overlapping pial surfaces, disconnected areas that require several rounds of manual edits. Sometimes, there are issues that we just cannot fix and we exclude that scan from downstream analyses.

Since we have limited access to GPUs, I was wondering if there is a way to improve the runtime on CPUs with some type of multithreading.

NinjMenon commented 4 years ago

@LeHenschel : It is taking me ~ 1.5 hours on a single CPU. Here is the abridged log-file:

repos/FastSurfer$ grep -e "seconds" /fs0/Niranjana/T1/fastsurfer/outputs/209883/scripts/deep-seg.log
Axial View Tested in 2089.9220 seconds
Coronal View Tested in 1797.4549 seconds
Sagittal View Tested in 2069.5206 seconds
View Aggregation finished in 23.0727 seconds

It looks like each step is taking ~30-35 minutes.

Abridged specs:

repos/FastSurfer$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                24
On-line CPU(s) list:   0-23
Thread(s) per core:    2
Core(s) per socket:    6
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Model name:            Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz
Stepping:              4
CPU MHz:               2902.465
CPU max MHz:           3100.0000
CPU min MHz:           1200.0000
BogoMIPS:              5200.02
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              15360K

Do you have any ideas to improve the run time on CPU?

tashrifbillah commented 4 years ago

@NinjMenon , I thought you said FastSurfer has the potential of reducing your manual edit effort. By the way, I also noticed the long running time of (i) FastSurferCNN on CPU. If that is indeed the case, I don't think we have any improvement over FreeSurfer at the CPU front. Still, I am interested to explore how the outputs of shorter run time on GPU compare with that of FreeSurfer. If you at all find yourself doing that comparison on CPU, feel free to let me know.

NinjMenon commented 4 years ago

@tashrifbillah Yes that's what we're trying to evaluate - whether using FastSurfer either completely eliminates the need for manual edits, or reduces the processing time significantly so that even if we needed to do manual edits, the overall time required to complete one scan will be reduced.

I'll try with the GPU method, if the CPU method does not have a way of being sped up.

m-reuter commented 4 years ago

Hi, CPU processing can be sped up by using many cores. E.g. on our servers it grabs all 72 cores and FastSurferCNN (3 segmentation networks and view aggregation) finishes in around 20-23 mins. When using only 6 cores it takes 1h. It is really better to use a GPU for this (1min).

tashrifbillah commented 4 years ago

Hi @m-reuter , any comment on step (ii) run time?

m-reuter commented 4 years ago

That can be found in our paper https://www.sciencedirect.com/science/article/pii/S1053811920304985 Table 3, around 100 min sequential w/o spherical registration, and 54 min when parallelizing hemis and using 4 threads.

About GPU for the step i (CNNs + view aggregation) Leonie (thanks!) just tested on our RTX 2080 (8GB GPU RAM, affordable GPU 500-600 Euros) and it takes 40 seconds even with small batch sizes.

NinjMenon commented 4 years ago

Hi, CPU processing can be sped up by using many cores. E.g. on our servers it grabs all 72 cores and FastSurferCNN (3 segmentation networks and view aggregation) finishes in around 20-23 mins. When using only 6 cores it takes 1h. It is really better to use a GPU for this (1min).

@m-reuter Ah that makes more sense. I was using only 1 core. Is there a flag to set the number of cores it can use?

I did some benchmarking on our HPC and got these runtimes:

Step Time HH:MM:SS
seg only - 1 GPU 00:02:15
surf only - 1 GPU, 4 threads, parallel hemis 02:53:15
seg only - 1 CPU 01:30:06
surf only - 1 CPU, 4 threads , parallel hemis 02:14:34

CPU is a Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz and GPU is RTX 2080 Ti.

I'm really curious about the run time differences for the surface pipeline between CPU and GPU - shouldn't the GPU version be much faster?

tashrifbillah commented 4 years ago

surface pipeline between CPU and GPU - shouldn't the GPU version be much faster?

FreeSurfer shouldn't use any GPU, it should be CPU only. So the run time for both should be the same. I guess they are different in your profile based on the load in HPC when you profiled.

m-reuter commented 4 years ago

Yes, step 2 recon-surf does not use gpu. Make sure nothing else is running on the same resources when you benchmark.

tashrifbillah commented 4 years ago

Question for you @NinjMenon , in your initial command you used --batch 1. Is the above profile with the same? If it is, increasing that should speed up inference on CPU.

As for your other comment:

I was using only 1 core. Is there a flag to set the number of cores it can use?

I don't think you can tell how many CPUs CUDA used for step (i). I believe it by default uses all your CPUs. If I am right, then it is what it is. Unless you can get access to a GPU, there may not be any way to bring down the run time.

@m-reuter feel free to correct me if I am wrong.

NinjMenon commented 4 years ago

Question for you @NinjMenon , in your initial command you used --batch 1. Is the above profile with the same? If it is, increasing that should speed up inference on CPU.

Yep I am running the same command - I will try that out. Thanks!

As for your other comment:

I was using only 1 core. Is there a flag to set the number of cores it can use?

I don't think you can tell how many CPUs CUDA used for step (i). I believe it by default uses all your CPUs. If I am right, then it is what it is. Unless you can get access to a GPU, there may not be any way to bring down the run time.

@m-reuter feel free to correct me if I am wrong.

I thought @m-reuter was talking about running the segmentation on CPU alone and that it is faster when using more cores. Did I get that wrong?

tashrifbillah commented 4 years ago

In @m-reuter 's comment:

CPU processing can be sped up by using many cores. E.g. on our servers it grabs all 72 cores

I think he meant get access to a machine with more CPUs :D

NinjMenon commented 4 years ago

Okay now I am confused - just trying to dumb this down for myself:

72 cores = 20-30 minutes 6 cores = 1 hour

And if it is grabbing all available cores on my gateway, which is definitely more than 6 right now: I get 1.5 hours. Am I missing something here? Thanks for helping me understand!

m-reuter commented 4 years ago

Could be you have a slower CPU, could be you are sharing the system with others, could be there is a scheduling system that gives you only one core even though more are free ... For fast CNN based segmentation, I recommend getting a GPU. Also if necessary, to cut time on CPU you could think about running the three networks in parallel on different hardware (may require some coding).

LeHenschel commented 4 years ago

Hey @NinjMenon,

this might also have to do with the pytorch version you are using. If I interpret your initial command correctly, you are using a local install. For cpu inference you should use the cpu optimized pytorch version. In addition, you can set the number of cores which will be used via the OMP_NUM_THREADS environment variable. By default (so if this is not globally set for you), the cpu pytorch version should use all the available cores.

See also here for more information (so this is for tensorflow, but the same applies for pytorch): https://software.intel.com/content/www/us/en/develop/articles/maximize-tensorflow-performance-on-cpu-considerations-and-recommendations-for-inference.html

Best, Leonie

NinjMenon commented 4 years ago

@m-reuter Thanks! I appreciate the answers. I will run some tests with 6-24 cpus and see if there is a benefit. We do have a GPU, but our access to it is limited. We have a significantly higher proportion of CPUs and better access, that's why we wanted to evaluate the CPU version.

@LeHenschel that is super helpful thanks! I was using the local install until I got a singularity version working.

Thanks for all your help guys!

tashrifbillah commented 4 years ago

Hi all, I have made an interesting discovery with the following data:

data_type      INT16
dim1           176
dim2           256
dim3           256
dim4           1
datatype       4
pixdim1        1.000000
pixdim2        1.000000
pixdim3        1.000000
pixdim4        2.530000
cal_max        0.0000
cal_min        0.0000
file_type      NIFTI-1+

The whole FastSurfer pipeline completed in less than an hour as stipulated in your documentation. Can anyone rationalize this fastness compared to the previous data I posted? The only obvious thing is, my new data is of higher resolution-- 176x256x256 compared to 256x256x44 before.

I guess the question will be two fold--

  1. To what space FreeSurfer warps given data?
  2. Does the 2nd step ((ii) recon-surf) take more time for low-resolution or a resolution that is significantly different than the one FreeSurfer warps to?
m-reuter commented 4 years ago

FreeSurfer and Fastsurfer are designed to work with inputs around 1mm isotropic resolution. That means that if you input anything with voxel sizes thicker than 1.5mm you are operating outside of the supported region. Probably your 44 slices images has large voxel sizes, which means it will produce lots of problems in the pipeline. These result in much longer runtimes e.g. in the topology fixer. Anything can happy with low-quality input.

LeHenschel commented 4 years ago

Regarding your first question: FastSurferCNN was trained with conformed images (UCHAR, 256x256x256, 1 mm voxels and standard slice orientation). These specifications are checked in the eval.py script and the image is automatically conformed if it does not comply. This is basically also what FreeSurfer does (equivalent to orig.mgz).

tashrifbillah commented 4 years ago

Does conformed means warped?

LeHenschel commented 4 years ago

Well, conformed means that we basically apply the same transformation as FreeSurfer's mri_convert -c, which turns image intensity values into UCHAR, reslices images to standard position, fills up slices to standard 256x256x256 format and enforces 1 mm isotropic voxel sizes. This includes some interpolation when the image is mapped/resliced to the new voxel space (RAS orientation).

tashrifbillah commented 4 years ago

that means that if you input anything with voxel sizes thicker than 1.5mm you are operating outside of the supported region.

Okay, then @m-reuter 's comment does not apply, does it?

LeHenschel commented 4 years ago

No, it still applies because your original input data is quite different from the default 1 mm voxel sizes.

tashrifbillah commented 4 years ago

How does it? Shouldn't all FreeSurfer steps happen after mri_convert -c ? Or are you saying that step explains the additional time?

NinjMenon commented 4 years ago

FreeSurfer and Fastsurfer are designed to work with inputs around 1mm isotropic resolution. That means that if you input anything with voxel sizes thicker than 1.5mm you are operating outside of the supported region. Probably your 44 slices images has large voxel sizes, which means it will produce lots of problems in the pipeline. These result in much longer runtimes e.g. in the topology fixer. Anything can happy with low-quality input.

Well, all my T1s are 256x256x170 and 1 mm isotropic. The surface piece of the pipeline still takes a little over 2 hours on a reserved machine (nothing else running) with 20 threads, parallel hemis.

tashrifbillah commented 4 years ago

Thanks for further sharing your observation @NinjMenon . Now I am getting convinced that I got lucky with my latter MRI.

m-reuter commented 4 years ago

Hi guys to clear up a few things:

  1. voxel sizes of 4mm (like what you have @tashrifbillah ) is supported neither by FreeSurfer nor FastSurfer. With that stuff you are on your own and most likely will get poor results.
  2. the conform step reslices to 1mm isotropic input. This is not a non-linear warp, just resampling, and some rotation. Yet it will not create the missing information magically. So while the code will run, results can be expected to be poor and run-times can be slow.
  3. Even 1mm inputs can have low quality, e.g. old or diseased subjects tend to move more (motion artifacts), or you may have poor image contrast. These could be reasons that increase the run time. Also, I doubt that 20 threads help speed up things, probably the optimum is 4-8 threads. If hemis run in parallel, there may be passages where you have double the number of threads specified, which is why FreeSurfer uses 4 threads as default. So for @NinjMenon 2h it would be interesting, what inputs were used (MPRAGES, 3T or 1.5T scanner what manufacturer, head coil channels). Also the question is what step is so slow (e.g. mris_fix_topology ??). Most likely FreeSurfer will also be similarly slower on those images.
tashrifbillah commented 4 years ago

These could be reasons that increase the run time.

Okay, I guess the less than 1 hour run time would be for nearly ideal data only.

which is why FreeSurfer uses 4 threads as default.

I think the default is single thread. When you do -parallel, then it uses 4 threads by default for each hemi, so 8 total. See this:

Note that a couple of the hemi stages (eg. mris_sphere) make use of a tiny amount of OpenMP code, which means that for brief periods, as many as 8 cores are utilized (2 binaries running code that each make use of 4 threads).

NinjMenon commented 4 years ago

Hi guys to clear up a few things:

  1. voxel sizes of 4mm (like what you have @tashrifbillah ) is supported neither by FreeSurfer nor FastSurfer. With that stuff you are on your own and most likely will get poor results.
  2. the conform step reslices to 1mm isotropic input. This is not a non-linear warp, just resampling, and some rotation. Yet it will not create the missing information magically. So while the code will run, results can be expected to be poor and run-times can be slow.
  3. Even 1mm inputs can have low quality, e.g. old or diseased subjects tend to move more (motion artifacts), or you may have poor image contrast. These could be reasons that increase the run time. Also, I doubt that 20 threads help speed up things, probably the optimum is 4-8 threads. If hemis run in parallel, there may be passages where you have double the number of threads specified, which is why FreeSurfer uses 4 threads as default. So for @NinjMenon 2h it would be interesting, what inputs were used (MPRAGES, 3T or 1.5T scanner what manufacturer, head coil channels). Also the question is what step is so slow (e.g. mris_fix_topology ??). Most likely FreeSurfer will also be similarly slower on those images.

The scans are all MPRAGE on 3T Phillips Acheiva with a 32 channel head coil. With 4 threads, no matter if I run exclusively on a 6,12 or 24 core system, I end up with a run time between 3 - 3.5 hours.

tashrifbillah commented 4 years ago

Hey @NinjMenon , I think you also want to comment on the following:

  1. Even 1mm inputs can have low quality, e.g. old or diseased subjects tend to move more (motion artifacts), or you may have poor image contrast. These could be reasons that increase the run time.
NinjMenon commented 4 years ago

@tashrifbillah We may have 4-5 scans that have motion, but the ones I tested on do not have any motion. Image contrast is also satisfactory. We do study older adults who may or may not have certain amounts of atrophy within brain regions, but I tested scans with on both ends of the spectrum, and I get the same run time in both cases.

m-reuter commented 4 years ago

Hi, thanks for those details. Everything looks good. We have very little phillips scans so we cannot test your input image types. If you can share an image, I can run it here to see if we have similar issues and what causes the reduced speed. If that is an option I will let you know how you can get the image directly to us via a file drop.

Our timings (table 3 in the paper : https://www.sciencedirect.com/science/article/pii/S1053811920304985 ) were done with data from the OASIS 1 study. Maybe run a couple of OASIS cases on your system to see if you can get the same speed as we do. That would help to figure out if it is related to your images or your hardware/system.

tashrifbillah commented 4 years ago

@m-reuter , do you want my low resolution anyway?

m-reuter commented 4 years ago

No, that is not supported. I mean the 1mm isotropic Phillips scans.

tashrifbillah commented 4 years ago

Fair enough. I shall profile a few of my other subjects having near about 1mm^3 and let you know.

NinjMenon commented 4 years ago

@m-reuter Can you share instructions on how to send you our image via file drop? I just got approval to do that today.

NinjMenon commented 4 years ago

Hi @m-reuter! Sorry for the delay but we found some interesting things as we started curating data to send to you. We have a few scans from young, healthy staff volunteers that we use for sequence calibration. The full surface pipeline ran in 46 minutes on this scan. We went back to our participant pool, and found T1s with zero motion and very little atrophy. The run time for these was a little more than 1.5 hours. Just curious to know what run times you got with AD individuals when you were looking at group differences in your paper?

m-reuter commented 4 years ago

We got 1h run time there (oasis). The question is what step is so slow for you. It could be the topology fixer ( mris_fix_topology ) . That can take very different times depending on the initial surface. Also you can try oasis subjects to see if you can replicate the speed.

NinjMenon commented 4 years ago

mris_fix_topology took 28.08 minutes on the no motion low atrophy scans. The hemispheres finish running about 1 hour into the pipeline. The remaining steps (stats generation etc) take an additional 20-30 minutes. I have a log file here for one of these scans. I'll be able to try one of the OASIS scans next week, and I will let you know of the results.

tashrifbillah commented 4 years ago

Hi @NinjMenon ,

Two questions for you--

  1. The full surface pipeline ran in 46 minutes on this scan. We went back to our participant pool, and found T1s with zero motion and very little atrophy. The run time for these was a little more than 1.5 hours.

Is the existence of atrophy different between the 46 minutes and 1.5 hours data?

  1. mris_fix_topology took 28.08 minutes on the no motion low atrophy scans. The hemispheres finish running about 1 hour into the pipeline. The remaining steps (stats generation etc) take an additional 20-30 minutes.

The sum is about 2 hours. Didn't you say 1.5 hours in your previous comment?

The run time for these was a little more than 1.5 hours

NinjMenon commented 4 years ago

@tashrifbillah I think the effects of aging are causing the time difference, rather than atrophy. There was very little atrophy on the 1.5-hour scan, but the aging effects are quite visible.

mris_fix_topology is a segment of running the hemispheres parallelly. The total is 1.5 hours. The log is attached to that comment.

m-reuter commented 4 years ago

Atrophy often looks like accelerated aging. Aging can induce other effects (e.g. increased motion, some motion artifacts may be small and barely visible). mris_fix_topology is a step in FreeSurfer (and FastSurfer) that can take very differently long depending on the image. I have never analyzed why speed changes. It could also be that it takes longer for specific contrast images, specific vendors, specific age groups, specific motion, etc. Also in FreeSurfer you will find this variance. It can differ between left and right hemi. I looked at a random oasis case and that one took 12 min for that step (single-threaded). So if you add 15 min on each hemisphere, you get half an hour longer runtime.

Background: surface topology fixing is time intensive and the more topological defects it finds, the longer it will take. This can be used as a criterion for image quality to some extend, meaning if the topology fixer takes really, really long, chances are high that something is wrong or challenging with the image. (half an hour is not yet considered really really long)