Updated to remove CUDA dependency

tonyreina commented 4 years ago

MLPerf update to remove dependency on CUDA for the data loader. This should autodetect the presence of CUDA so will work as in past with GPU and will also fall back to CPU data loading if CUDA not found.

FabianIsensee commented 4 years ago

Hey @tonyreina , thanks for these suggestions. The sole purpose of the pin memory loop is to pin the torch tensors so that they can be copied to GPU faster. It could be that it also hides some of the process-process communication overhead. Including these changes only makes sense if my intuition about hiding communication overhead is correct (in the context of a system that does not have cuda available). If there is no performance difference then it would be better to simply set pin_memory=False in the MultithreadedAugmenter. I will have some time to check this myself tomorrow. I will let you know the results :-)

tonyreina commented 4 years ago

Sounds great. I wasn't sure of the best way to handle it so I'll go with whatever you think makes sense.

FabianIsensee commented 4 years ago

Hi Toni, results are in. I created a small script to compare run times. I attach it to this message in case you are interested. The script simulates a training-like load with a dataloader and measures how quickly it runs through a set number of iterations. The dataloader is configures so that there is no CPU bottleneck and I chose the simulated training load such that the process-process communication should be visible. Running this on the current master (with GPU present) gives the following results:

pin_memory = True this takes 45.7458291053772 s pin_memory = False this takes 57.25228023529053 s

Indeed, the code with pin_memory runs faster then without, although more work was being done. This points to my theory about hiding communication overhead it correct.

To verify this I checked out your master and reran the snipped (with CUDA_VISIBLE_DEVICES=None and I verified that

In [2]: torch.cuda.is_available()
Out[2]: False

These are the results:

pin_memory = True this takes 43.05331873893738 s pin_memory = False this takes 55.707196950912476 s

So yes, your way of doing this will result in better performance. I will merge your pull request now :-)

Best, Fabian

from time import time

from batchgenerators.dataloading import SlimDataLoaderBase, MultiThreadedAugmenter
import numpy as np
from batchgenerators.transforms import NumpyToTensor

class DummyDataLoader(SlimDataLoaderBase):
    def __init__(self):
        data = np.random.random((2, 4, 192, 224, 192)).astype(np.float32) # simulate a batch from brats
        labels = np.random.randint(0, 4, (2, 1, 192, 224, 192), dtype=np.int32)
        super().__init__((data, labels), None, 0)

    def generate_train_batch(self):
        return {'data': self._data[0], 'labels': self._data[1]}

def get_dummy_dataloader(num_processes=8, pin_memory=True):
    dl = DummyDataLoader()
    mt = MultiThreadedAugmenter(dl, NumpyToTensor(), num_processes, 2, None, pin_memory, 1)
    return mt

def run_dummy_iterations(mta, num_warumup=10, num_iters=100):
    for _ in range(num_warumup):
        _ = mta.next()
    start = time()
    for _ in range(num_iters):
        _ = mta.next()
        # simulate some load. this takes about half a s on my PC
        a = np.random.random((2, 4, 192, 224, 192)).astype(np.float32)
    end = time()
    return end - start

if __name__ == '__main__':
    print('pin_memory = True')
    mta = get_dummy_dataloader(8, True)
    print('this takes', run_dummy_iterations(mta, 10, 100), 's')

    print('pin_memory = False')
    mta = get_dummy_dataloader(8, False)
    print('this takes', run_dummy_iterations(mta, 10, 100), 's')

FabianIsensee commented 4 years ago

Your contribution motivated me to always enable a thread for hiding the process-process communication bottleneck. I have pushed my updates to master, feel free to check them out. I verified that this is still compatible with nnunet CPU training.

Now we always get a performance benefit, regardless of whether pin_memory is set:

pin_memory = True using pin_memory on device 0 this takes 42.54739475250244 s pin_memory = False abort event is set this takes 43.718252182006836 s

MIC-DKFZ / batchgenerators

Updated to remove CUDA dependency #58