Closed tonyreina closed 4 years ago
Hey @tonyreina , thanks for these suggestions. The sole purpose of the pin memory loop is to pin the torch tensors so that they can be copied to GPU faster. It could be that it also hides some of the process-process communication overhead. Including these changes only makes sense if my intuition about hiding communication overhead is correct (in the context of a system that does not have cuda available). If there is no performance difference then it would be better to simply set pin_memory=False in the MultithreadedAugmenter. I will have some time to check this myself tomorrow. I will let you know the results :-)
Sounds great. I wasn't sure of the best way to handle it so I'll go with whatever you think makes sense.
Hi Toni, results are in. I created a small script to compare run times. I attach it to this message in case you are interested. The script simulates a training-like load with a dataloader and measures how quickly it runs through a set number of iterations. The dataloader is configures so that there is no CPU bottleneck and I chose the simulated training load such that the process-process communication should be visible. Running this on the current master (with GPU present) gives the following results:
pin_memory = True this takes 45.7458291053772 s pin_memory = False this takes 57.25228023529053 s
Indeed, the code with pin_memory runs faster then without, although more work was being done. This points to my theory about hiding communication overhead it correct.
To verify this I checked out your master and reran the snipped (with CUDA_VISIBLE_DEVICES=None
and I verified that
In [2]: torch.cuda.is_available()
Out[2]: False
These are the results:
pin_memory = True this takes 43.05331873893738 s pin_memory = False this takes 55.707196950912476 s
So yes, your way of doing this will result in better performance. I will merge your pull request now :-)
Best, Fabian
from time import time
from batchgenerators.dataloading import SlimDataLoaderBase, MultiThreadedAugmenter
import numpy as np
from batchgenerators.transforms import NumpyToTensor
class DummyDataLoader(SlimDataLoaderBase):
def __init__(self):
data = np.random.random((2, 4, 192, 224, 192)).astype(np.float32) # simulate a batch from brats
labels = np.random.randint(0, 4, (2, 1, 192, 224, 192), dtype=np.int32)
super().__init__((data, labels), None, 0)
def generate_train_batch(self):
return {'data': self._data[0], 'labels': self._data[1]}
def get_dummy_dataloader(num_processes=8, pin_memory=True):
dl = DummyDataLoader()
mt = MultiThreadedAugmenter(dl, NumpyToTensor(), num_processes, 2, None, pin_memory, 1)
return mt
def run_dummy_iterations(mta, num_warumup=10, num_iters=100):
for _ in range(num_warumup):
_ = mta.next()
start = time()
for _ in range(num_iters):
_ = mta.next()
# simulate some load. this takes about half a s on my PC
a = np.random.random((2, 4, 192, 224, 192)).astype(np.float32)
end = time()
return end - start
if __name__ == '__main__':
print('pin_memory = True')
mta = get_dummy_dataloader(8, True)
print('this takes', run_dummy_iterations(mta, 10, 100), 's')
print('pin_memory = False')
mta = get_dummy_dataloader(8, False)
print('this takes', run_dummy_iterations(mta, 10, 100), 's')
Your contribution motivated me to always enable a thread for hiding the process-process communication bottleneck. I have pushed my updates to master, feel free to check them out. I verified that this is still compatible with nnunet CPU training.
Now we always get a performance benefit, regardless of whether pin_memory is set:
pin_memory = True using pin_memory on device 0 this takes 42.54739475250244 s pin_memory = False abort event is set this takes 43.718252182006836 s
MLPerf update to remove dependency on CUDA for the data loader. This should autodetect the presence of CUDA so will work as in past with GPU and will also fall back to CPU data loading if CUDA not found.