Closed leoozy closed 3 years ago
I have used the same environment but did not observe any related errors.
I believe the problem might be related to the multi-gpu (distributed) training We use apex version 0.1, you can download it from the following link: https://github.com/NVIDIA/apex
Namespace(K_shift=4, batch_size=1, data_path='~/data/', dataset='imagenet', epochs=1000, error_step=5, image_size=(224, 224, 3), imagenet_path='~/data/ImageNet', load_path='logs/210329imagenet_resnet18_imagenet_unsup_simclr_CSI_shift_rotation/last.model', local_rank=0, lr_init=0.1, lr_scheduler='cosine', mode='ood_pre', model='resnet18_imagenet', multi_gpu=False, n_classes=30, n_gpus=1, no_strict=False, one_class_idx=None, ood_batch_size=100, ood_dataset=['cub', 'stanford_dogs', 'flowers102', 'places365', 'food_101', 'caltech_256', 'dtd', 'pets'], ood_layer=['simclr', 'shift'], ood_samples=10, ood_score=['CSI'], optimizer='lars', print_score=True, resize_factor=0.54, resize_fix=True, resume_path=None, save_score=False, save_step=10, shift_trans=Rotation(), shift_trans_type='rotation', sim_lambda=1.0, simclr_dim=128, suffix=None, task='eval', temperature=0.5, test_batch_size=1, warmup=10, weight_decay=1e-06)
Pre-compute global statistics...
Traceback (most recent call last):
File "eval.py", line 23, in cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)
load_path=logs/210329imagenet_resnet18_imagenet_unsup_simclr_CSI_shift_rotation/last.model CUDA_VISIBLE_DEVICES=3 python eval.py --mode ood_pre \ --ood_score CSI\ --print_score \ --dataset imagenet \ --model resnet18_imagenet \ --shift_trans_type rotation \ --ood_samples 10 \ --resize_factor 0.54 \ --resize_fix \ --load_path ${load_path}
I tired to run the evaluation on a single GPU. The error is aboved. Do you know why ?
I solved this by closing the cudnn
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=700 : an illegal memory access was encountered Traceback (most recent call last): File "train.py", line 38, in
train(P, epoch, model, criterion, optimizer, scheduler_warmup, train_loader, logger=logger, kwargs)
File "/home/westlake/zhangjunlei/code/auto-ood/training/unsup/simclr_CSI.py", line 57, in train
images_pair = simclr_aug(images_pair) # transform
File "/home/westlake/miniconda3/envs/zjl/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, *kwargs)
File "/home/westlake/miniconda3/envs/zjl/lib/python3.6/site-packages/apex/parallel/distributed.py", line 560, in forward
result = self.module(inputs, kwargs)
File "/home/westlake/miniconda3/envs/zjl/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, *kwargs)
File "/home/westlake/miniconda3/envs/zjl/lib/python3.6/site-packages/torch/nn/modules/container.py", line 100, in forward
input = module(input)
File "/home/westlake/miniconda3/envs/zjl/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(input, *kwargs)
File "/home/westlake/zhangjunlei/code/auto-ood/models/transform_layers.py", line 388, in forward
return inputs (1 - _mask) + self.transform(inputs) * _mask
File "/home/westlake/zhangjunlei/code/auto-ood/models/transform_layers.py", line 381, in transform
inputs = t(inputs)
File "/home/westlake/zhangjunlei/code/auto-ood/models/transform_layers.py", line 371, in adjust_hsv
return RandomHSVFunction.apply(x, f_h, f_s, f_v)
File "/home/westlake/zhangjunlei/code/auto-ood/models/transform_layers.py", line 396, in forward
x = rgb2hsv(x)
File "/home/westlake/zhangjunlei/code/auto-ood/models/transform_layers.py", line 40, in rgb2hsv
hsv = torch.stack([hue, saturate, value], dim=1)
RuntimeError: cuda runtime error (700) : an illegal memory access was encountered at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:278
NCCL error in: /pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:65, unhandled cuda error, NCCL version 2.4.8
Pytorch: 1.4 CUDA: 10.1 cudnn:7.6.3 python:3.6.2
HellO, I tired to run your code, but I got an error. Could you help me with this?