Closed bentaculum closed 1 year ago
Added making arrays contiguous to torch.Train
and torch.Predict
, which is necessary for conversion to torch.Tensor
.
@benjamin9555 Why is the block from line 289-294 deleted?
Also, we don't think it is a good idea to force the arrays to be contiguous by default. Especially for GPU based computation, this can add overhead recreating contiguous arrays in memory before sending them to the GPU. At most, there should be an option to force contiguity.
@benjamin9555 Why is the block from line 289-294 deleted?
Ah, we see now that it is a duplicate from lines 265-270. Good catch!
Also, we don't think it is a good idea to force the arrays to be contiguous by default. Especially for GPU based computation, this can add overhead recreating contiguous arrays in memory before sending them to the GPU. At most, there should be an option to force contiguity.
The current stable version of torch v1.7.1 does still not allow to create a torch.Tensor from a non-contiguous numpy array. Here's a minimal example that will throw an error:
import torch
import numpy as np
a = np.zeros(2)[::-1]
t = torch.as_tensor(a)
numpy.ascontiguousarray
will not create a copy in memory if the array is already contiguous, which is what we need, right?
I think the question is whether if you do torch.as_tensor(a, device='cuda'), does it copy it straight to the GPU and not throw an error? I tried to test this locally but pytorch and my local gpu driver are not in agreement on version. If it doesn't error, I think the question is if it is still more efficient to make a local tensor (including copying it to contiguous memory) and asynchronously copy to the GPU after, or to do it all at once by synchronously. @funkey thoughts?
import torch
import numpy as np
a = np.zeros(2)[::-1]
t = torch.as_tensor(a, device='cuda')
also throws an error. If we assume that torch will eventually allow tensor creation from non-contiguous arrays, then I agree with you, let's make forcing contiguity optional. Otherwise it makes sense to me to fix it and not bother users with it.
I just thought a bit this a bit, converting a non-contiguous numpy array might not be trivial to write probably involves some overhead , as torch would have to transcribe the numpy indexing to their indexing. I don't know the details of either of them though.
I think the question is if it is still more efficient to make a local tensor (including copying it to contiguous memory) and asynchronously copy to the GPU after, or to do it all at once by synchronously.
Good point. Right now we are forced to provide a contiguous numpy array to start, therefore the conversion to torch.Tensor on CPU does not copy the data and should thereby be negligible for performance. To compare the remaining transfer of the torch.tensor to GPU, I have written the minimal example above, on my machine sync and async transfer took the same amount of time in this isolated fashion. However, in a favorable scenario, the forward pass of the neural network can already start once the raw data is on the GPU, even if the labels etc. are still being transferred asynchronously. This can matter, e.g. if you have large multi-channel labels like the shape descriptors.
Transferring tensors to GPU can be done asynchronously with
x.to(device='cuda', non_blocking=True)
. In some cases I observe this to speed up a training pass.Here I compared the current
gunpowder.torch.Train
transfer calls to the proposed changes in an isolated example, to make sure there is no extra overhead from separating the cast totorch.tensor
from the transfer to GPU.