NVIDIA / MinkowskiEngine

Minkowski Engine is an auto-diff neural network library for high-dimensional sparse tensors
https://nvidia.github.io/MinkowskiEngine
Other
2.43k stars 360 forks source link

RuntimeError: copy_if failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered #460

Open ZixuanChen613 opened 2 years ago

ZixuanChen613 commented 2 years ago

I got this runtime error in colab when I run the evaluate_4dpanoptic.py from https://github.com/PRBonn/contrastive_association. I can run the train.py and the other .py successfully. I think this is an error with MinkowskiEngine.


GPU available: True, used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs Setup finished, starting evaluation LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] Testing: 0it [00:00, ?it/s]Traceback (most recent call last): File "evaluate_4dpanoptic.py", line 83, in main() File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 829, in call return self.main(args, kwargs) File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 782, in main rv = self.invoke(ctx) File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1066, in invoke return ctx.invoke(self.callback, ctx.params) File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 610, in invoke return callback(args, kwargs) File "evaluate_4dpanoptic.py", line 52, in main trainer.test(model,data.val_dataloader()) File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 911, in test return self._call_and_handle_interrupt(self._test_impl, model, dataloaders, ckpt_path, verbose, datamodule) File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt return trainer_fn(*args, *kwargs) File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 954, in _test_impl results = self._run(model, ckpt_path=self.tested_ckpt_path) File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run self._dispatch() File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 1275, in _dispatch self.training_type_plugin.start_evaluating(self) File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 206, in start_evaluating self._results = trainer.run_stage() File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 1286, in run_stage return self._run_evaluate() File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 1334, in _run_evaluate eval_loop_results = self._evaluation_loop.run() File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/base.py", line 145, in run self.advance(args, kwargs) File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 110, in advance dl_outputs = self.epoch_loop.run(dataloader, dataloader_idx, dl_max_batches, self.num_dataloaders) File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/base.py", line 145, in run self.advance(*args, kwargs) File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 122, in advance output = self._evaluation_step(batch, batch_idx, dataloader_idx) File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 213, in _evaluation_step output = self.trainer.accelerator.test_step(step_kwargs) File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 244, in test_step return self.training_type_plugin.test_step(step_kwargs.values()) File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 222, in test_step return self.model.test_step(args, kwargs) File "/content/drive/MyDrive/Colab_Notebooks/master_thesis/contrastive_association/cont_assoc/models/ps4d_models.py", line 100, in test_step sem_pred, ins_pred = self(x) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/content/drive/MyDrive/Colab_Notebooks/master_thesis/contrastive_association/cont_assoc/models/ps4d_models.py", line 91, in forward ins_feat, n_ins, ins_ids, ins_pred, tracking_input = self.get_ins_feat(x, ins_pred, raw_features) File "/content/drive/MyDrive/Colab_Notebooks/master_thesis/contrastive_association/cont_assoc/models/ps4d_models.py", line 62, in get_ins_feat ins_feat = self.tracking_head(tracking_input) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, *kwargs) File "/content/drive/MyDrive/Colab_Notebooks/master_thesis/contrastive_association/cont_assoc/models/contrastive_models.py", line 64, in forward sparse = self.sparse_tensor(coors, x['pt_features']) # torch.Size([336, 128]) ; (4,7) torch.Size([434, 128]) File "/content/drive/MyDrive/Colab_Notebooks/master_thesis/contrastive_association/cont_assoc/models/contrastive_models.py", line 59, in sparsetensor sparse = ME.SparseTensor(features=f, coordinates=c_.int(),device='cuda') # torch.Size([336, 128]) File "/usr/local/lib/python3.7/dist-packages/MinkowskiEngine/MinkowskiSparseTensor.py", line 272, in init coordinates, features, coordinate_map_key File "/usr/local/lib/python3.7/dist-packages/MinkowskiEngine/MinkowskiSparseTensor.py", line 300, in initialize_coordinates ) = self._manager.insert_and_map(coordinates, coordinate_map_key.get_key()) File "/usr/local/lib/python3.7/dist-packages/MinkowskiEngine/MinkowskiCoordinateManager.py", line 179, in insert_and_map return self._manager.insert_and_map(coordinates, tensor_stride, string_id) RuntimeError: copy_if failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered Exception ignored in: <function tqdm.del at 0x7fe57d938170> Traceback (most recent call last): File "/usr/local/lib/python3.7/dist-packages/tqdm/std.py", line 1087, in del File "/usr/local/lib/python3.7/dist-packages/tqdm/std.py", line 1294, in close File "/usr/local/lib/python3.7/dist-packages/tqdm/std.py", line 1472, in display File "/usr/local/lib/python3.7/dist-packages/tqdm/std.py", line 1090, in repr File "/usr/local/lib/python3.7/dist-packages/tqdm/std.py", line 1434, in format_dict TypeError: cannot unpack non-iterable NoneType object


Steve-Tod commented 2 years ago

Got the exact same issue here from different codes. Seems if I run some pytorch computation on cude first than this error would happen. If I start from running Minkowski Engine code, then the pytorch code would raise some cuda error. Is there any suggestion on this?

Steve-Tod commented 2 years ago

Somehow changing from pytorch=1.8.1, cuda=10.1 to pytorch=1.10.1, cuda=11.3 solve the problem.