Open renezurbruegg opened 1 year ago
Hi, I also have the same problem with a Minkowski Convolution with kernel size equals to 2 and stride equals to 2. I oberserved that the layer gives two different outputs randomly (one with higher probability). The problem also seems to occur with MinkowskiGenerativeConvolutionTranspose. Do you have any clue about this?
So far I was not able to fix the randomness on GPU. The only thing that makes it deterministic, is running inference on CPU, which is not a feasible option for me.
I am slightly confused if this randomness was introduced in one of the latest versions. The issue seems to be quite severe, making most work relying on Minkowski Engine non-reproducible.
After several experiments, I might have found a possible solution to this problem. After adding a sort function on the output of the convolution layer, which will sort the sparse tensor according to its coordinates in some deterministic order(for example, for coordinate (x,y,z), sort according to x*(max3)+y*(max*2)+z(max1), where max = max(x,y,z)), I can reproduce the result on cuda. I think the possible reason is that Minkowski Engine sends sparse tensor to different cuda kernels randomly and synchronize the results to get the outputt, and somehow this synchronization may generate unreproducible results. However, I am not sure how could this affect the training and I would appreciate it if it someone can give a more precise explanation.
Amazing! Would you mind sharing the sorting code? This seems like a good solution to at least have consistent results at inference time.
Here is the sorting code I used.
def array2vector(array, step): array, step = array.long().cpu(), step.long().cpu() vector = sum([array[:,i]*(step**i) for i in range(array.shape[-1])]) return vector
def sort_spare_tensor(sparse_tensor): indices_sort = np.argsort(array2vector(sparse_tensor.C.cpu(), sparse_tensor.C.cpu().max()+1)) sparse_tensor_sort = ME.SparseTensor(features=sparse_tensor.F[indices_sort], coordinates=sparse_tensor.C[indices_sort], tensor_stride=sparse_tensor.tensor_stride[0], device=sparse_tensor.device) return sparse_tensor_sort
Thanks @WilliamHBW, this solved my problem!
A "deterministic" MinkowskiConvolution could be implemented as follows
` class SortedMinkowskiConvolution(ME.MinkowskiConvolution):
def forward(self, input):
# Sort the coordinates
weights = torch.tensor([1e12, 1e8, 1e4, 1], device=input.device)
sortable_vals = (input.C * weights).sum(dim=1)
sorted_coords_indices = sortable_vals.argsort()
input = ME.SparseTensor(
features=input.F[sorted_coords_indices],
coordinates=input.C[sorted_coords_indices],
tensor_stride=input.tensor_stride,
device=input.device
)
output = super().forward(input)
return output
` Interestingly, I had to wrap all MinkwoskiLayers and Activations in my model.
Thanks @WilliamHBW, this solved my problem!
A "deterministic" MinkowskiConvolution could be implemented as follows
` class SortedMinkowskiConvolution(ME.MinkowskiConvolution):
def forward(self, input): # Sort the coordinates weights = torch.tensor([1e12, 1e8, 1e4, 1], device=input.device) sortable_vals = (input.C * weights).sum(dim=1) sorted_coords_indices = sortable_vals.argsort() input = ME.SparseTensor( features=input.F[sorted_coords_indices], coordinates=input.C[sorted_coords_indices], tensor_stride=input.tensor_stride, device=input.device ) output = super().forward(input) return output
` Interestingly, I had to wrap all MinkwoskiLayers and Activations in my model.
Hi, When I use SortedMinkowskiConvolution to execute the above code, I find that when stride=2, it may cause confusion in the coordinate manager. How to solve this problem? If you can share the modified class UNet (ME.MinkowskiNetwork), it will be very helpful to me. Thank you very much.
Describe the bug It seems like any MinkowskiConvolution with stride > 1 produces non-deterministic features when executed on the GPU and no shared coordinate manager is used.
Running on the CPU seems to produce deterministic outputs Also, the quantization behavior seems to also be non-deterministic when nonquantized tensors are passed to SparseTensor().
My network relies on the intermittent features of the U-Net Architecture. Does someone know, how MinkowskiEngine can be used in a deterministic fashion?
To Reproduce
This prints:
Expected behavior The Error should be zero for each layer.
Desktop (please complete the following information): ==========System========== Linux-5.4.0-153-generic-x86_64-with-glibc2.31 DISTRIB_ID=Ubuntu DISTRIB_RELEASE=20.04 DISTRIB_CODENAME=focal DISTRIB_DESCRIPTION="Ubuntu 20.04.4 LTS" 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:26:04) [GCC 10.4.0] ==========Pytorch========== 1.13.1+cu117 torch.cuda.is_available(): True ==========NVIDIA-SMI========== /usr/bin/nvidia-smi Driver Version 525.125.06 CUDA Version 12.0 VBIOS Version 94.04.3F.00.C5 Image Version G001.0000.03.03 GSP Firmware Version N/A ==========NVCC========== sh: 1: nvcc: not found ==========CC========== /usr/bin/c++ c++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Copyright (C) 2019 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
==========MinkowskiEngine========== 0.5.4 MinkowskiEngine compiled with CUDA Support: True NVCC version MinkowskiEngine is compiled: 11070 CUDART version MinkowskiEngine is compiled: 11070
Additional context Add any other context about the problem here.