Transfer torch tensors to GPU asynchronously

bentaculum commented 3 years ago

Transferring tensors to GPU can be done asynchronously with x.to(device='cuda', non_blocking=True). In some cases I observe this to speed up a training pass.

Here I compared the current gunpowder.torch.Train transfer calls to the proposed changes in an isolated example, to make sure there is no extra overhead from separating the cast to torch.tensor from the transfer to GPU.

import time
import torch
import numpy as np

x_np = np.random.rand(128, 128, 128, 128)
n_iter = 10

# init cuda
torch.cuda.set_device(0)
torch.zeros((128, 128, 128, 128), device='cuda')

start = time.time()
for i in range(n_iter):
    x = torch.as_tensor(x_np).to(device='cuda', non_blocking=True)
x_back = x.detach().cpu().numpy()
print(f"Asynchronous: {(time.time() - start) / n_iter} s")

start = time.time()
for i in range(n_iter):
    x = torch.as_tensor(x_np, device='cuda')
x_back = x.detach().cpu().numpy()
print(f"Current gunpowder: {(time.time() - start) / n_iter} s")

bentaculum commented 3 years ago

Added making arrays contiguous to torch.Train and torch.Predict, which is necessary for conversion to torch.Tensor.

cmalinmayor commented 3 years ago

@benjamin9555 Why is the block from line 289-294 deleted?

cmalinmayor commented 3 years ago

Also, we don't think it is a good idea to force the arrays to be contiguous by default. Especially for GPU based computation, this can add overhead recreating contiguous arrays in memory before sending them to the GPU. At most, there should be an option to force contiguity.

cmalinmayor commented 3 years ago

@benjamin9555 Why is the block from line 289-294 deleted?

Ah, we see now that it is a duplicate from lines 265-270. Good catch!

bentaculum commented 3 years ago

Also, we don't think it is a good idea to force the arrays to be contiguous by default. Especially for GPU based computation, this can add overhead recreating contiguous arrays in memory before sending them to the GPU. At most, there should be an option to force contiguity.

The current stable version of torch v1.7.1 does still not allow to create a torch.Tensor from a non-contiguous numpy array. Here's a minimal example that will throw an error:

import torch
import numpy as np
a = np.zeros(2)[::-1]
t = torch.as_tensor(a)

numpy.ascontiguousarray will not create a copy in memory if the array is already contiguous, which is what we need, right?

cmalinmayor commented 3 years ago

I think the question is whether if you do torch.as_tensor(a, device='cuda'), does it copy it straight to the GPU and not throw an error? I tried to test this locally but pytorch and my local gpu driver are not in agreement on version. If it doesn't error, I think the question is if it is still more efficient to make a local tensor (including copying it to contiguous memory) and asynchronously copy to the GPU after, or to do it all at once by synchronously. @funkey thoughts?

bentaculum commented 3 years ago

import torch
import numpy as np
a = np.zeros(2)[::-1]
t = torch.as_tensor(a, device='cuda')

also throws an error. If we assume that torch will eventually allow tensor creation from non-contiguous arrays, then I agree with you, let's make forcing contiguity optional. Otherwise it makes sense to me to fix it and not bother users with it.

I just thought a bit this a bit, converting a non-contiguous numpy array might not be trivial to write ~~probably involves some overhead~~ , as torch would have to transcribe the numpy indexing to their indexing. I don't know the details of either of them though.

I think the question is if it is still more efficient to make a local tensor (including copying it to contiguous memory) and asynchronously copy to the GPU after, or to do it all at once by synchronously.

Good point. Right now we are forced to provide a contiguous numpy array to start, therefore the conversion to torch.Tensor on CPU does not copy the data and should thereby be negligible for performance. To compare the remaining transfer of the torch.tensor to GPU, I have written the minimal example above, on my machine sync and async transfer took the same amount of time in this isolated fashion. However, in a favorable scenario, the forward pass of the neural network can already start once the raw data is on the GPU, even if the labels etc. are still being transferred asynchronously. This can matter, e.g. if you have large multi-channel labels like the shape descriptors.

funkelab / gunpowder

Transfer torch tensors to GPU asynchronously #147