alinlab / CSI

CSI: Novelty Detection via Contrastive Learning on Distributionally Shifted Instances (NeurIPS 2020)
https://arxiv.org/abs/2007.08176
275 stars 62 forks source link

THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=700 : an illegal memory access was encountered #27

Closed leoozy closed 3 years ago

leoozy commented 3 years ago

THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=700 : an illegal memory access was encountered Traceback (most recent call last): File "train.py", line 38, in train(P, epoch, model, criterion, optimizer, scheduler_warmup, train_loader, logger=logger, kwargs) File "/home/westlake/zhangjunlei/code/auto-ood/training/unsup/simclr_CSI.py", line 57, in train images_pair = simclr_aug(images_pair) # transform File "/home/westlake/miniconda3/envs/zjl/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, *kwargs) File "/home/westlake/miniconda3/envs/zjl/lib/python3.6/site-packages/apex/parallel/distributed.py", line 560, in forward result = self.module(inputs, kwargs) File "/home/westlake/miniconda3/envs/zjl/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, *kwargs) File "/home/westlake/miniconda3/envs/zjl/lib/python3.6/site-packages/torch/nn/modules/container.py", line 100, in forward input = module(input) File "/home/westlake/miniconda3/envs/zjl/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(input, *kwargs) File "/home/westlake/zhangjunlei/code/auto-ood/models/transform_layers.py", line 388, in forward return inputs (1 - _mask) + self.transform(inputs) * _mask File "/home/westlake/zhangjunlei/code/auto-ood/models/transform_layers.py", line 381, in transform inputs = t(inputs) File "/home/westlake/zhangjunlei/code/auto-ood/models/transform_layers.py", line 371, in adjust_hsv return RandomHSVFunction.apply(x, f_h, f_s, f_v) File "/home/westlake/zhangjunlei/code/auto-ood/models/transform_layers.py", line 396, in forward x = rgb2hsv(x) File "/home/westlake/zhangjunlei/code/auto-ood/models/transform_layers.py", line 40, in rgb2hsv hsv = torch.stack([hue, saturate, value], dim=1) RuntimeError: cuda runtime error (700) : an illegal memory access was encountered at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:278 NCCL error in: /pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:65, unhandled cuda error, NCCL version 2.4.8

Pytorch: 1.4 CUDA: 10.1 cudnn:7.6.3 python:3.6.2

HellO, I tired to run your code, but I got an error. Could you help me with this?

jihoontack commented 3 years ago

I have used the same environment but did not observe any related errors.

I believe the problem might be related to the multi-gpu (distributed) training We use apex version 0.1, you can download it from the following link: https://github.com/NVIDIA/apex

leoozy commented 3 years ago

Namespace(K_shift=4, batch_size=1, data_path='~/data/', dataset='imagenet', epochs=1000, error_step=5, image_size=(224, 224, 3), imagenet_path='~/data/ImageNet', load_path='logs/210329imagenet_resnet18_imagenet_unsup_simclr_CSI_shift_rotation/last.model', local_rank=0, lr_init=0.1, lr_scheduler='cosine', mode='ood_pre', model='resnet18_imagenet', multi_gpu=False, n_classes=30, n_gpus=1, no_strict=False, one_class_idx=None, ood_batch_size=100, ood_dataset=['cub', 'stanford_dogs', 'flowers102', 'places365', 'food_101', 'caltech_256', 'dtd', 'pets'], ood_layer=['simclr', 'shift'], ood_samples=10, ood_score=['CSI'], optimizer='lars', print_score=True, resize_factor=0.54, resize_fix=True, resume_path=None, save_score=False, save_step=10, shift_trans=Rotation(), shift_trans_type='rotation', sim_lambda=1.0, simclr_dim=128, suffix=None, task='eval', temperature=0.5, test_batch_size=1, warmup=10, weight_decay=1e-06) Pre-compute global statistics... Traceback (most recent call last): File "eval.py", line 23, in train_loader=train_loader, simclr_aug=simclr_aug) File "/home/westlake/zhangjunlei/code/auto-ood/evals/ood_pre.py", line 42, in eval_ood_detection feats_train = get_features(P, f'{P.dataset}_train', model, train_loader, prefix=prefix, kwargs) # (M, T, d) File "/home/westlake/zhangjunlei/code/auto-ood/evals/ood_pre.py", line 147, in get_features simclr_aug, sample_num, layers=left) File "/home/westlake/zhangjunlei/code/auto-ood/evals/ood_pre.py", line 197, in _getfeatures , output_aux = model(x_t, kwargs) File "/home/westlake/miniconda3/envs/zjl/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, *kwargs) File "/home/westlake/zhangjunlei/code/auto-ood/models/base_model.py", line 27, in forward output = self.linear(features) File "/home/westlake/miniconda3/envs/zjl/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(input, **kwargs) File "/home/westlake/miniconda3/envs/zjl/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 87, in forward return F.linear(input, self.weight, self.bias) File "/home/westlake/miniconda3/envs/zjl/lib/python3.6/site-packages/torch/nn/functional.py", line 1370, in linear ret = torch.addmm(bias, input, weight.t()) RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

load_path=logs/210329imagenet_resnet18_imagenet_unsup_simclr_CSI_shift_rotation/last.model CUDA_VISIBLE_DEVICES=3 python eval.py --mode ood_pre \ --ood_score CSI\ --print_score \ --dataset imagenet \ --model resnet18_imagenet \ --shift_trans_type rotation \ --ood_samples 10 \ --resize_factor 0.54 \ --resize_fix \ --load_path ${load_path}

I tired to run the evaluation on a single GPU. The error is aboved. Do you know why ?

leoozy commented 3 years ago

I solved this by closing the cudnn