Open lidq92 opened 2 years ago
Another machine (that reproduced the randomness) info
==========System==========
Linux-4.4.0-176-generic-x86_64-with-debian-buster-sid
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.4 LTS"
3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 16:07:37)
[GCC 9.3.0]
==========Pytorch==========
1.9.0
torch.cuda.is_available(): True
==========NVIDIA-SMI==========
/usr/bin/nvidia-smi
Driver Version 440.33.01
CUDA Version 10.2
VBIOS Version 86.02.23.00.01
Image Version G610.0200.00.03
==========NVCC==========
/usr/local/cuda/bin/nvcc
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89
==========CC==========
/usr/bin/c++
c++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
==========MinkowskiEngine==========
0.5.4
MinkowskiEngine compiled with CUDA Support: True
NVCC version MinkowskiEngine is compiled: 10020
CUDART version MinkowskiEngine is compiled: 10020
Besides, on this machine, results are the same on CPU no matter which value is set for OMP_NUM_THREADS
I found src/3rdparty/cudf/detail/utilities/device_atomics.cuh and atomicAdd
. Is this randomness introduced by the routine cublasSetAtomicsMode()?Can I easily choose another rountine for a deterministic behavior on GPU training? Thanks!
Hello,
I have the same issue on my own model using MinkowskiEngine with all the metrics values during training.
I tried to reproduce your issue, and I also observe the same problem using your code. The scale of differences seems to be lower using your simple example than the differences I have with the big models I am using in my project.
I am very interested if you have found a solution because it is indeed a serious problem for reproducibility even if the values remain "somewhat" similar.
I found src/3rdparty/cudf/detail/utilities/device_atomics.cuh and
atomicAdd
. Is this randomness introduced by the routine cublasSetAtomicsMode()?Can I easily choose another rountine for a deterministic behavior on GPU training? Thanks!
sir, Have you achieved it?
Randomness found during training on GPU Randomness is found when training the MEmodel with GPU (A100 or 2080Ti or P40 or T4) on the same enviroment (same machine with Ubuntu 18.04, torch==1.12.1, MinkowskiEngine==0.5.4 (installed with system python) with CUDA 10.2)
I'm confused about the randomness problem, and what might have caused it?
Thanks for your help.
Best regards, Dingquan
To Reproduce
[Skip] reproducible on both CPU and GPU without MinkowskiEngine used (i.e., torch only)
[Skip] reproducible on CPU for the same OMP_NUM_THREADS
[To Reproduce] not reproducible on GPU with MinkowskiEngine used
a minimally reproducible code.
class MWEDataset(torch.utils.data.Dataset): def init(self): super(MWEDataset, self).init()
def minkowski_collate_fn(list_data): coordinates_batch, features_batch, labels_batch = ME.utils.sparse_collate( [d["coordinates"] for d in list_data], [d["features"] for d in list_data], [d["labels"] for d in list_data], dtype=torch.float32, )
def global_avg_pool(inputs): batch_size = torch.max(inputs.coordinates[:, 0]).item() + 1 outputs = [] for k in range(batch_size): input = inputs.features[inputs.coordinates[:, 0] == k] output = torch.mean(input, dim=0) outputs.append(output) outputs = torch.stack(outputs, dim=0) return outputs
class MWEModel(ME.MinkowskiNetwork): def init(self, D=3, CHANNELS=[3, 3, 3, 1]): ME.MinkowskiNetwork.init(self, D) self.conv1 = ME.MinkowskiConvolution(CHANNELS[0], CHANNELS[1], kernel_size=3, dimension=D) self.bn1 = ME.MinkowskiBatchNorm(CHANNELS[1]) self.relu1 = ME.MinkowskiReLU() self.pool1 = ME.MinkowskiMaxPooling(kernel_size=3, stride=2, dimension=D) self.gap = ME.MinkowskiGlobalAvgPooling() self.feature = ME.MinkowskiToFeature() self.fc1 = nn.Linear(CHANNELS[1], CHANNELS[2]) self.bn = nn.BatchNorm1d(CHANNELS[2]) self.fc2 = nn.Linear(CHANNELS[2], CHANNELS[3])
def run(config): device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = MWEModel().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=config.lr) loss_func = torch.nn.SmoothL1Loss() trainset = MWEDataset()
if name == "main": parser = argparse.ArgumentParser(description='MWEtestME') parser.add_argument("--seed", type=int, default=19920517) parser.add_argument('--batch_size', type=int, default=2) parser.add_argument('--lr', type=float, default=1e-3) parser.add_argument('--max_epoch', type=int, default=10) config = parser.parse_args()