Duankaiwen / CenterNet

Codes for our paper "CenterNet: Keypoint Triplets for Object Detection" .
MIT License
1.86k stars 384 forks source link

train error, RuntimeError: CuDNN error: CUDNN_STATUS_MAPPING_ERROR #111

Open guohaoyuan opened 4 years ago

guohaoyuan commented 4 years ago

loading all datasets... using 4 threads loading from cache file: cache/coco_trainval2014.pkl No cache file found... loading annotations into memory... Done (t=7.66s) creating index... index created! 82783it [00:23, 3500.57it/s] loading annotations into memory... Done (t=7.17s) creating index... index created! loading from cache file: cache/coco_trainval2014.pkl loading annotations into memory... Done (t=7.03s) creating index... index created! loading from cache file: cache/coco_trainval2014.pkl loading annotations into memory... Done (t=5.85s) creating index... index created! loading from cache file: cache/coco_trainval2014.pkl loading annotations into memory... Done (t=6.67s) creating index... index created! loading from cache file: cache/coco_minival2014.pkl No cache file found... loading annotations into memory... Done (t=2.40s) creating index... index created! 40504it [00:11, 3429.47it/s] loading annotations into memory... Done (t=5.35s) creating index... index created! system config... {'batch_size': 2, 'cache_dir': 'cache', 'chunk_sizes': [2], 'config_dir': 'config', 'data_dir': '../data', 'data_rng': <mtrand.RandomState object at 0x7f5488fbfea0>, 'dataset': 'MSCOCO', 'decay_rate': 10, 'display': 5, 'learning_rate': 0.00025, 'max_iter': 480000, 'nnet_rng': <mtrand.RandomState object at 0x7f5488fbff30>, 'opt_algo': 'adam', 'prefetch_size': 6, 'pretrain': None, 'result_dir': 'results', 'sampling_function': 'kp_detection', 'snapshot': 5000, 'snapshot_name': 'CenterNet-52', 'stepsize': 450000, 'test_split': 'testdev', 'train_split': 'trainval', 'val_iter': 500, 'val_split': 'minival', 'weight_decay': False, 'weight_decay_rate': 1e-05, 'weight_decay_type': 'l2'} db config... {'ae_threshold': 0.5, 'border': 128, 'categories': 80, 'data_aug': True, 'gaussian_bump': True, 'gaussian_iou': 0.7, 'gaussian_radius': -1, 'input_size': [511, 511], 'kp_categories': 1, 'lighting': True, 'max_per_image': 100, 'merge_bbox': False, 'nms_algorithm': 'exp_soft_nms', 'nms_kernel': 3, 'nms_threshold': 0.5, 'output_sizes': [[128, 128]], 'rand_color': True, 'rand_crop': True, 'rand_pushes': False, 'rand_samples': False, 'rand_scale_max': 1.4, 'rand_scale_min': 0.6, 'rand_scale_step': 0.1, 'rand_scales': array([ 0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2, 1.3]), 'special_crop': False, 'test_scales': [1], 'top_k': 70, 'weight_exp': 8} len of db: 82783 start prefetching data... shuffling indices... start prefetching data... shuffling indices... start prefetching data... shuffling indices... start prefetching data... shuffling indices... building model... module_file: models.CenterNet-52 start prefetching data... shuffling indices... /home/yangxilab/anaconda3/envs/CenterNet/lib/python3.6/site-packages/torch/cuda/init.py:114: UserWarning: Found GPU0 TITAN RTX which requires CUDA_VERSION >= 9000 for optimal performance and fast startup time, but your PyTorch was compiled with CUDA_VERSION 8000. Please install the correct PyTorch binary using instructions from http://pytorch.org

warnings.warn(incorrect_binary_warn % (d, name, 9000, CUDA_VERSION)) /home/yangxilab/anaconda3/envs/CenterNet/lib/python3.6/site-packages/torch/cuda/init.py:114: UserWarning: Found GPU1 TITAN RTX which requires CUDA_VERSION >= 9000 for optimal performance and fast startup time, but your PyTorch was compiled with CUDA_VERSION 8000. Please install the correct PyTorch binary using instructions from http://pytorch.org

warnings.warn(incorrect_binary_warn % (d, name, 9000, CUDA_VERSION)) total parameters: 104844152 setting learning rate to: 0.00025 training start... 0%| | 0/480000 [00:00<?, ?it/s]Traceback (most recent call last): File "train.py", line 203, in train(training_dbs, validation_db, args.start_iter) File "train.py", line 138, in train training_loss, focal_loss, pull_loss, push_loss, regr_loss = nnet.train(training) File "/home/yangxilab/GHY/GHY/CenterNet/nnet/py_factory.py", line 82, in train loss_kp = self.network(xs, ys) File "/home/yangxilab/anaconda3/envs/CenterNet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, *kwargs) File "/home/yangxilab/GHY/GHY/CenterNet/models/py_utils/data_parallel.py", line 70, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/yangxilab/GHY/GHY/CenterNet/models/py_utils/data_parallel.py", line 80, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/yangxilab/anaconda3/envs/CenterNet/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 77, in parallel_apply raise output File "/home/yangxilab/anaconda3/envs/CenterNet/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 53, in _worker output = module(input, kwargs) File "/home/yangxilab/anaconda3/envs/CenterNet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, kwargs) File "/home/yangxilab/GHY/GHY/CenterNet/nnet/py_factory.py", line 20, in forward preds = self.model(*xs, *kwargs) File "/home/yangxilab/anaconda3/envs/CenterNet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(input, kwargs) File "/home/yangxilab/GHY/GHY/CenterNet/nnet/py_factory.py", line 32, in forward return self.module(*xs, kwargs) File "/home/yangxilab/anaconda3/envs/CenterNet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, *kwargs) File "/home/yangxilab/GHY/GHY/CenterNet/models/py_utils/kp.py", line 289, in forward return self._train(xs, kwargs) File "/home/yangxilab/GHY/GHY/CenterNet/models/py_utils/kp.py", line 193, in _train inter = self.pre(image) File "/home/yangxilab/anaconda3/envs/CenterNet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, kwargs) File "/home/yangxilab/anaconda3/envs/CenterNet/lib/python3.6/site-packages/torch/nn/modules/container.py", line 91, in forward input = module(input) File "/home/yangxilab/anaconda3/envs/CenterNet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, *kwargs) File "/home/yangxilab/GHY/GHY/CenterNet/models/py_utils/utils.py", line 14, in forward conv = self.conv(x) File "/home/yangxilab/anaconda3/envs/CenterNet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(input, kwargs) File "/home/yangxilab/anaconda3/envs/CenterNet/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 301, in forward self.padding, self.dilation, self.groups) RuntimeError: CuDNN error: CUDNN_STATUS_MAPPING_ERROR ##################### Help the poor boy,please

UmarSpa commented 4 years ago

I have the same problem.

UmarSpa commented 4 years ago

Solved it by running the code with pytorch 1.0: https://github.com/UmarSpa/CenterNet

guohaoyuan commented 4 years ago

Solved it by running the code with pytorch 1.0: https://github.com/UmarSpa/CenterNet

loading all datasets... using 4 threads loading from cache file: cache/coco_trainval2014.pkl loading annotations into memory... Done (t=6.77s) creating index... index created! loading from cache file: cache/coco_trainval2014.pkl loading annotations into memory... Done (t=6.04s) creating index... index created! loading from cache file: cache/coco_trainval2014.pkl loading annotations into memory... Done (t=7.50s) creating index... index created! loading from cache file: cache/coco_trainval2014.pkl loading annotations into memory... Done (t=6.41s) creating index... index created! loading from cache file: cache/coco_minival2014.pkl loading annotations into memory... Done (t=1.92s) creating index... index created! system config... {'batch_size': 8, 'cache_dir': 'cache', 'chunk_sizes': [4, 4], 'config_dir': 'config', 'data_dir': './data', 'data_rng': <mtrand.RandomState object at 0x7f0ec45307e0>, 'dataset': 'MSCOCO', 'decay_rate': 10, 'display': 5, 'learning_rate': 0.00025, 'max_iter': 480000, 'nnet_rng': <mtrand.RandomState object at 0x7f0ec4530828>, 'opt_algo': 'adam', 'prefetch_size': 6, 'pretrain': None, 'result_dir': 'results', 'sampling_function': 'kp_detection', 'snapshot': 5000, 'snapshot_name': 'CenterNet-52', 'stepsize': 450000, 'test_split': 'testdev', 'train_split': 'trainval', 'val_iter': 500, 'val_split': 'minival', 'weight_decay': False, 'weight_decay_rate': 1e-05, 'weight_decay_type': 'l2'} db config... {'ae_threshold': 0.5, 'border': 128, 'categories': 80, 'data_aug': True, 'gaussian_bump': True, 'gaussian_iou': 0.7, 'gaussian_radius': -1, 'input_size': [511, 511], 'kp_categories': 1, 'lighting': True, 'max_per_image': 100, 'merge_bbox': False, 'nms_algorithm': 'exp_soft_nms', 'nms_kernel': 3, 'nms_threshold': 0.5, 'output_sizes': [[128, 128]], 'rand_color': True, 'rand_crop': True, 'rand_pushes': False, 'rand_samples': False, 'rand_scale_max': 1.4, 'rand_scale_min': 0.6, 'rand_scale_step': 0.1, 'rand_scales': array([0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2, 1.3]), 'special_crop': False, 'test_scales': [1], 'top_k': 70, 'weight_exp': 8} len of db: 82783 start prefetching data... shuffling indices... start prefetching data... shuffling indices... start prefetching data... shuffling indices... start prefetching data... shuffling indices... start prefetching data... building model... module_file: models.CenterNet-52 shuffling indices... total parameters: 104844152 setting learning rate to: 0.00025 training start... 0%| | 0/480000 [00:00<?, ?it/s]THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1544174967633/work/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument

Traceback (most recent call last): File "train.py", line 203, in train(training_dbs, validation_db, args.start_iter) File "train.py", line 138, in train training_loss, focal_loss, pull_loss, push_loss, regr_loss = nnet.train(training) File "/home/yangxilab/GHY/GHY/CenterNet1.0-master/nnet/py_factory.py", line 82, in train loss_kp = self.network(xs, ys) File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, *kwargs) File "/home/yangxilab/GHY/GHY/CenterNet1.0-master/models/py_utils/data_parallel.py", line 70, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/yangxilab/GHY/GHY/CenterNet1.0-master/models/py_utils/data_parallel.py", line 80, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply raise output File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker output = module(input, kwargs) File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, kwargs) File "/home/yangxilab/GHY/GHY/CenterNet1.0-master/nnet/py_factory.py", line 20, in forward preds = self.model(*xs, *kwargs) File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(input, kwargs) File "/home/yangxilab/GHY/GHY/CenterNet1.0-master/nnet/py_factory.py", line 32, in forward return self.module(*xs, kwargs) File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, *kwargs) File "/home/yangxilab/GHY/GHY/CenterNet1.0-master/models/py_utils/kp.py", line 289, in forward return self._train(xs, kwargs) File "/home/yangxilab/GHY/GHY/CenterNet1.0-master/models/py_utils/kp.py", line 193, in _train inter = self.pre(image) File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, kwargs) File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, *kwargs) File "/home/yangxilab/GHY/GHY/CenterNet1.0-master/models/py_utils/utils.py", line 14, in forward conv = self.conv(x) File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(input, kwargs) File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 320, in forward self.padding, self.dilation, self.groups) RuntimeError: cuda runtime error (11) : invalid argument at /opt/conda/conda-bld/pytorch_1544174967633/work/aten/src/THC/THCGeneral.cpp:405 Exception in thread Thread-1: Traceback (most recent call last): File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "train.py", line 51, in pin_memory data = data_queue.get() File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/multiprocessing/queues.py", line 113, in get return _ForkingPickler.loads(res) File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 256, in rebuild_storage_fd fd = df.detach() File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/multiprocessing/resource_sharer.py", line 57, in detach with _resource_sharer.get_connection(self._id) as conn: File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/multiprocessing/resource_sharer.py", line 87, in get_connection c = Client(address, authkey=process.current_process().authkey) File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/multiprocessing/connection.py", line 493, in Client answer_challenge(c, authkey) File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/multiprocessing/connection.py", line 737, in answer_challenge response = connection.recv_bytes(256) # reject large message File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes buf = self._recv_bytes(maxlength) File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes buf = self._recv(4) File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/multiprocessing/connection.py", line 379, in _recv chunk = read(handle, remaining) ConnectionResetError: [Errno 104] Connection reset by peer

Exception in thread Thread-2: Traceback (most recent call last): File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "train.py", line 51, in pin_memory data = data_queue.get() File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/multiprocessing/queues.py", line 113, in get return _ForkingPickler.loads(res) File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 256, in rebuild_storage_fd fd = df.detach() File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/multiprocessing/resource_sharer.py", line 57, in detach with _resource_sharer.get_connection(self._id) as conn: File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/multiprocessing/resource_sharer.py", line 87, in get_connection c = Client(address, authkey=process.current_process().authkey) File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/multiprocessing/connection.py", line 493, in Client answer_challenge(c, authkey) File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/multiprocessing/connection.py", line 732, in answer_challenge message = connection.recv_bytes(256) # reject large message File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes buf = self._recv_bytes(maxlength) File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes buf = self._recv(4) File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/multiprocessing/connection.py", line 379, in _recv chunk = read(handle, remaining) ConnectionResetError: [Errno 104] Connection reset by peer Fatal Python error: could not acquire lock for <_io.BufferedWriter name=''> at interpreter shutdown, possibly due to daemon threads

Thread 0x00007f0d33ab2700 (most recent call first): File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/threading.py", line 926 in _bootstrap_inner File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/threading.py", line 884 in _bootstrap

Current thread 0x00007f0f1b1a8700 (most recent call first): Aborted (core dumped) ##################################################### thank you for your code! but I meet this problem. It doesn't look like the reason for batch size and chunk'sizes

guohaoyuan commented 4 years ago

Solved it by running the code with pytorch 1.0: https://github.com/UmarSpa/CenterNet thank you for your arts again! I have solve this problem! cuda is responsible for this problem