NVIDIA / MinkowskiEngine

Minkowski Engine is an auto-diff neural network library for high-dimensional sparse tensors
https://nvidia.github.io/MinkowskiEngine
Other
2.42k stars 357 forks source link

Non Reproducible Outputs on GPU when using MinkowskiConvolution and stride > 1 #554

Open renezurbruegg opened 1 year ago

renezurbruegg commented 1 year ago

Describe the bug It seems like any MinkowskiConvolution with stride > 1 produces non-deterministic features when executed on the GPU and no shared coordinate manager is used.

Running on the CPU seems to produce deterministic outputs Also, the quantization behavior seems to also be non-deterministic when nonquantized tensors are passed to SparseTensor().

My network relies on the intermittent features of the U-Net Architecture. Does someone know, how MinkowskiEngine can be used in a deterministic fashion?

To Reproduce

import MinkowskiEngine as ME
import torch
import MinkowskiEngine.MinkowskiFunctional as MF
import random
import numpy as np
import torch

def fix_all_seeds():
    seed = 0
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.use_deterministic_algorithms(True)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

class UNet(ME.MinkowskiNetwork):
    def __init__(self, in_nchannel, out_nchannel, D):
        super(UNet, self).__init__(D)
        self.block1 = ME.MinkowskiConvolution(in_channels=in_nchannel, out_channels=8,  kernel_size=3,  stride=1, dimension=D)
        self.block2 =  ME.MinkowskiConvolution(in_channels=8,out_channels=16,kernel_size=3,stride=2,dimension=D)
        self.block2_tr = ME.MinkowskiConvolutionTranspose(in_channels=16,out_channels=16,kernel_size=3,stride=2,dimension=D)
        self.conv1_tr = ME.MinkowskiConvolution(in_channels=24,out_channels=out_nchannel,kernel_size=1,stride=1,dimension=D)

    def forward(self, x):
        data = []
        out_s1 = self.block1(x)
        data.append(out_s1)
        out = self.block2(out_s1)
        data.append(out)
        out = MF.relu(self.block2_tr(out))
        out = ME.cat(out, out_s1)
        data.append(out)
        return self.conv1_tr(out), data

device = "cuda"

net = UNet(1, 32, 3).to(device).eval()

# Load data
coords = torch.rand(512, 3)* 10
feats = torch.rand(512, 1)
coords, feats = ME.utils.sparse_quantize(coords, feats, quantization_size=1)

coords, feats = ME.utils.sparse_collate(coords=[coords], feats=[feats]) 

# Run twice
fix_all_seeds()
out1, aux1 = net(ME.SparseTensor(feats, coords, device = device))
fix_all_seeds()
out2, aux2 = net(ME.SparseTensor(feats, coords, device = device))

# Compare Error
for i, (t1,t2) in enumerate(zip(aux1, aux2)):
    print("Feature Error Layer:",i, torch.max(torch.abs(t1.features - t2.features)).item())

print("----")
# Without previous quantization
coords = torch.rand(512, 3)* 10
feats = torch.rand(512, 1)
# coords, feats = ME.utils.sparse_quantize(coords, feats, quantization_size=1)
coords, feats = ME.utils.sparse_collate(coords=[coords], feats=[feats]) 

# Run twice
fix_all_seeds()
out1, aux1 = net(ME.SparseTensor(feats, coords, device = device))
fix_all_seeds()
out2, aux2 = net(ME.SparseTensor(feats, coords, device = device))

for i, (t1,t2) in enumerate(zip(aux1, aux2)):
    print("Feature Error Layer",i, torch.max(torch.abs(t1.features - t2.features)).item())

This prints:

Feature Error Layer: 0 0.0
Feature Error Layer: 1 0.35769572854042053
Feature Error Layer: 2 0.0
----
Feature Error Layer 0 0.9394447803497314
Feature Error Layer 1 0.3751360774040222
Feature Error Layer 2 0.9394447803497314

Expected behavior The Error should be zero for each layer.


Desktop (please complete the following information): ==========System========== Linux-5.4.0-153-generic-x86_64-with-glibc2.31 DISTRIB_ID=Ubuntu DISTRIB_RELEASE=20.04 DISTRIB_CODENAME=focal DISTRIB_DESCRIPTION="Ubuntu 20.04.4 LTS" 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:26:04) [GCC 10.4.0] ==========Pytorch========== 1.13.1+cu117 torch.cuda.is_available(): True ==========NVIDIA-SMI========== /usr/bin/nvidia-smi Driver Version 525.125.06 CUDA Version 12.0 VBIOS Version 94.04.3F.00.C5 Image Version G001.0000.03.03 GSP Firmware Version N/A ==========NVCC========== sh: 1: nvcc: not found ==========CC========== /usr/bin/c++ c++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Copyright (C) 2019 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

==========MinkowskiEngine========== 0.5.4 MinkowskiEngine compiled with CUDA Support: True NVCC version MinkowskiEngine is compiled: 11070 CUDART version MinkowskiEngine is compiled: 11070


Additional context Add any other context about the problem here.

WilliamHBW commented 1 year ago

Hi, I also have the same problem with a Minkowski Convolution with kernel size equals to 2 and stride equals to 2. I oberserved that the layer gives two different outputs randomly (one with higher probability). The problem also seems to occur with MinkowskiGenerativeConvolutionTranspose. Do you have any clue about this?

renezurbruegg commented 1 year ago

So far I was not able to fix the randomness on GPU. The only thing that makes it deterministic, is running inference on CPU, which is not a feasible option for me.

I am slightly confused if this randomness was introduced in one of the latest versions. The issue seems to be quite severe, making most work relying on Minkowski Engine non-reproducible.

WilliamHBW commented 1 year ago

After several experiments, I might have found a possible solution to this problem. After adding a sort function on the output of the convolution layer, which will sort the sparse tensor according to its coordinates in some deterministic order(for example, for coordinate (x,y,z), sort according to x*(max3)+y*(max*2)+z(max1), where max = max(x,y,z)), I can reproduce the result on cuda. I think the possible reason is that Minkowski Engine sends sparse tensor to different cuda kernels randomly and synchronize the results to get the outputt, and somehow this synchronization may generate unreproducible results. However, I am not sure how could this affect the training and I would appreciate it if it someone can give a more precise explanation.

renezurbruegg commented 1 year ago

Amazing! Would you mind sharing the sorting code? This seems like a good solution to at least have consistent results at inference time.

WilliamHBW commented 1 year ago

Here is the sorting code I used.

def array2vector(array, step): array, step = array.long().cpu(), step.long().cpu() vector = sum([array[:,i]*(step**i) for i in range(array.shape[-1])]) return vector

def sort_spare_tensor(sparse_tensor): indices_sort = np.argsort(array2vector(sparse_tensor.C.cpu(), sparse_tensor.C.cpu().max()+1)) sparse_tensor_sort = ME.SparseTensor(features=sparse_tensor.F[indices_sort], coordinates=sparse_tensor.C[indices_sort], tensor_stride=sparse_tensor.tensor_stride[0], device=sparse_tensor.device) return sparse_tensor_sort

mic-rud commented 10 months ago

Thanks @WilliamHBW, this solved my problem!

A "deterministic" MinkowskiConvolution could be implemented as follows

` class SortedMinkowskiConvolution(ME.MinkowskiConvolution):

def forward(self, input):
    # Sort the coordinates
    weights = torch.tensor([1e12, 1e8, 1e4, 1], device=input.device) 
    sortable_vals = (input.C * weights).sum(dim=1)
    sorted_coords_indices = sortable_vals.argsort()

    input = ME.SparseTensor(
        features=input.F[sorted_coords_indices],
        coordinates=input.C[sorted_coords_indices],
        tensor_stride=input.tensor_stride,
        device=input.device
    )

    output = super().forward(input)

    return output

` Interestingly, I had to wrap all MinkwoskiLayers and Activations in my model.