ai-med / QuickNATv2

Fast Whole Brain Segmentation (Layers, codes and Pre-trained Models)
MIT License
61 stars 16 forks source link

not managing ~20 sec time on a P100 with 16GB memory #3

Closed egillax closed 6 years ago

egillax commented 6 years ago

First of all thank you for the great tool.

I've been playing around with your software and at most I'm able to segment in ~50 sec per subject with 95 frames. My gpu info is below. But I think I should be getting more close to your ~20 sec. Any idea why I'm not faster? I'm using cuda8.0 and matlab 2017a on a rhel server.

Regards, Egill Fridgeirsson

gpu info:

                 Name: 'Tesla P100-PCIE-16GB'
                 Index: 2
     ComputeCapability: '6.0'
        SupportsDouble: 1
         DriverVersion: 8
        ToolkitVersion: 8
    MaxThreadsPerBlock: 1024
      MaxShmemPerBlock: 49152
    MaxThreadBlockSize: [1024 1024 64]
           MaxGridSize: [2.1475e+09 65535 65535]
             SIMDWidth: 32
           TotalMemory: 1.7067e+10
       AvailableMemory: 1.6678e+10
   MultiprocessorCount: 56
          ClockRateKHz: 1328500
           ComputeMode: 'Default'
  GPUOverlapsTransfers: 1
KernelExecutionTimeout: 0
      CanMapHostMemory: 1
       DeviceSupported: 1
        DeviceSelected: 1 
abhi4ssj commented 6 years ago

Hi Egill,

Thanks for the comment. I also observed this variability in runtime across different settings. I think the main bottleneck is the CPU-GPU handshaking. As all the three networks (Coronal, Axial and Sagittal) are pretty huge, we cannot accommodate all of them in the GPU at the same time. So, the networks are continuously moved between GPU and CPU, which can be abit slow depending on the PCI.

In my workstation, I am using Titan Xp with Ubuntu 16.04 (no Server), where I get around 20 secs.

In case, you want to speed up more in your current setting, you may use only Coronal Network (the best among the 3), reducing computation by three times. The performance would be reduced by 1-2% Dice score though, which might be acceptable.

Let me know if you want me to create another version of RunFile using only coronal axis.

Best Abhijit

egillax commented 6 years ago

Hi Abhijit,

Thanks for the suggestion. Using only the coronal net I reach about 15 secs. I was also very happy with the 50secs using all three nets. Just curious what could be the reason why I wasn't closer to your performance. I'll also be eagerly waiting for the python implementation!

Thanks, Egill