Duankaiwen / CenterNet

Codes for our paper "CenterNet: Keypoint Triplets for Object Detection" .
MIT License
1.86k stars 384 forks source link

cuda10 trian issue #55

Open mama110 opened 5 years ago

mama110 commented 5 years ago

when I run python3 train.py CenterNet-52, I meet this error(my video card is rtx 2080ti and I'm using CUDA10 and Pytorch 1.1, and I've modify the batch_size and chunk_sizes to 2):

RuntimeError: Expected object of type Variable but found type CUDAType for argument #0 'result' (checked_cast_variable at /pytorch/torch/csrc/autograd/VariableTypeManual.cpp:173)

Is there a way that I can run this code by pytorch 1.1(cuda 10)? Thanks

Duankaiwen commented 5 years ago

Hi @mama110 If you use pytorch1.1, please refer to this: https://github.com/princeton-vl/CornerNet/pull/98/commits/380943208c35054e0fd30e34d361ebd93e5dd069. And you need to delete all the compiled files before recompiling the corner pooling layers.

mama110 commented 5 years ago

@Duankaiwen I've deleted the compiled folders(build and cpools-xxxxxxxx.egg) and recompiled the corner pool layer, but It doesn't work. Did I miss something?

Duankaiwen commented 5 years ago

Delete all files except src, init.py, setup.py

mama110 commented 5 years ago

I delete all compiled files and recompile the corner pooling layers, but the same error comes up.

RuntimeError: Expected object of type Variable but found type CUDAType for argument #0 'result' (checked_cast_variable at /pytorch/torch/csrc/autograd/VariableTypeManual.cpp:173)

By the way, the solution princeton-vl/CornerNet@3809432 solve the problem when I‘m compiling the corner pooling layer, but when I run the train.py, aforementioned error comes up.

Duankaiwen commented 5 years ago

Please show the full log

Duankaiwen commented 5 years ago

@mama110 https://github.com/princeton-vl/CornerNet/issues/104

mama110 commented 5 years ago

@Duankaiwen

kun@pupa:~/master/CenterNet-master$ python3 train.py CenterNet-104 loading all datasets... using 4 threads loading from cache file: cache/coco_trainval2014.pkl No cache file found... loading annotations into memory... Done (t=14.11s) creating index... index created! 118287it [00:38, 3092.92it/s] loading annotations into memory... Done (t=10.89s) creating index... index created! loading from cache file: cache/coco_trainval2014.pkl loading annotations into memory... Done (t=9.46s) creating index... index created! loading from cache file: cache/coco_trainval2014.pkl loading annotations into memory... Done (t=11.28s) creating index... index created! loading from cache file: cache/coco_trainval2014.pkl loading annotations into memory... Done (t=9.95s) creating index... index created! loading from cache file: cache/coco_minival2014.pkl No cache file found... loading annotations into memory... Done (t=0.47s) creating index... index created! 5000it [00:01, 3069.90it/s] loading annotations into memory... Done (t=0.29s) creating index... index created! system config... {'batch_size': 2, 'cache_dir': 'cache', 'chunk_sizes': [2], 'config_dir': 'config', 'data_dir': './data', 'data_rng': <mtrand.RandomState object at 0x7fae23f95168>, 'dataset': 'MSCOCO', 'decay_rate': 10, 'display': 5, 'learning_rate': 0.00025, 'max_iter': 480000, 'nnet_rng': <mtrand.RandomState object at 0x7fae23f951b0>, 'opt_algo': 'adam', 'prefetch_size': 6, 'pretrain': None, 'result_dir': 'results', 'sampling_function': 'kp_detection', 'snapshot': 5000, 'snapshot_name': 'CenterNet-104', 'stepsize': 450000, 'test_split': 'testdev', 'train_split': 'trainval', 'val_iter': 500, 'val_split': 'minival', 'weight_decay': False, 'weight_decay_rate': 1e-05, 'weight_decay_type': 'l2'} db config... {'ae_threshold': 0.5, 'border': 128, 'categories': 80, 'data_aug': True, 'gaussian_bump': True, 'gaussian_iou': 0.7, 'gaussian_radius': -1, 'input_size': [511, 511], 'kp_categories': 1, 'lighting': True, 'max_per_image': 100, 'merge_bbox': False, 'nms_algorithm': 'exp_soft_nms', 'nms_kernel': 3, 'nms_threshold': 0.5, 'output_sizes': [[128, 128]], 'rand_color': True, 'rand_crop': True, 'rand_pushes': False, 'rand_samples': False, 'rand_scale_max': 1.4, 'rand_scale_min': 0.6, 'rand_scale_step': 0.1, 'rand_scales': array([0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2, 1.3]), 'special_crop': False, 'test_scales': [1], 'top_k': 70, 'weight_exp': 8} len of db: 118287 start prefetching data... shuffling indices... start prefetching data... shuffling indices... start prefetching data... shuffling indices... start prefetching data... shuffling indices... building model... module_file: models.CenterNet-104 start prefetching data... shuffling indices... total parameters: 210062960 setting learning rate to: 0.00025 training start... 0%| | 0/480000 [00:00<?, ?it/s]/home/kun/.local/lib/python3.5/site-packages/torch/nn/_reduction.py:46: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead. warnings.warn(warning.format(ret))

Traceback (most recent call last): File "train.py", line 203, in train(training_dbs, validation_db, args.start_iter) File "train.py", line 163, in train nnet.set_lr(learning_rate) File "/usr/lib/python3.5/contextlib.py", line 77, in exit self.gen.throw(type, value, traceback) File "/home/kun/master/CenterNet-master/utils/tqdm.py", line 23, in stdout_to_tqdm raise exc File "/home/kun/master/CenterNet-master/utils/tqdm.py", line 21, in stdout_to_tqdm yield save_stdout File "train.py", line 138, in train training_loss, focal_loss, pull_loss, push_loss, regr_loss = nnet.train(*training) File "/home/kun/master/CenterNet-master/nnet/py_factory.py", line 93, in train loss.backward() File "/home/kun/.local/lib/python3.5/site-packages/torch/tensor.py", line 107, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/kun/.local/lib/python3.5/site-packages/torch/autograd/init.py", line 93, in backward allow_unreachable=True) # allow_unreachable flag File "/home/kun/.local/lib/python3.5/site-packages/torch/autograd/function.py", line 77, in apply return self._forward_cls.backward(self, args) File "/home/kun/master/CenterNet-master/models/py_utils/_cpools/init.py", line 59, in backward output = right_pool.backward(input, grad_output)[0] RuntimeError: Expected object of type Variable but found type CUDAType for argument #0 'result' (checked_cast_variable at /pytorch/torch/csrc/autograd/VariableTypeManual.cpp:173) frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7fadda674441 in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libc10.so) frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7fadda673d7a in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libc10.so) frame #2: torch::autograd::VariableType::checked_cast_variable(at::Tensor&, char const, int) + 0x169 (0x7fadd9146419 in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libtorch.so.1) frame #3: torch::autograd::VariableType::unpack(at::Tensor&, char const, int) + 0x9 (0x7fadd91464b9 in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libtorch.so.1) frame #4: torch::autograd::VariableType::s__th_gt_out(at::Tensor&, at::Tensor const&, at::Tensor const&) const + 0x24b (0x7fadd8f716fb in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libtorch.so.1) frame #5: at::TypeDefault::_th_gt_out(at::Tensor&, at::Tensor const&, at::Tensor const&) const + 0x205 (0x7faddb5841a5 in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libcaffe2.so) frame #6: at::TypeDefault::gt_out(at::Tensor&, at::Tensor const&, at::Tensor const&) const + 0x62 (0x7faddb566b02 in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libcaffe2.so) frame #7: torch::autograd::VariableType::gt_out(at::Tensor&, at::Tensor const&, at::Tensor const&) const + 0x35c (0x7fadd901074c in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libtorch.so.1) frame #8: pool_backward(at::Tensor, at::Tensor) + 0x826 (0x7fadcf17af56 in /home/kun/.local/lib/python3.5/site-packages/cpools-0.0.0-py3.5-linux-x86_64.egg/right_pool.cpython-35m-x86_64-linux-gnu.so) frame #9: + 0x129e4 (0x7fadcf1869e4 in /home/kun/.local/lib/python3.5/site-packages/cpools-0.0.0-py3.5-linux-x86_64.egg/right_pool.cpython-35m-x86_64-linux-gnu.so) frame #10: + 0x12afe (0x7fadcf186afe in /home/kun/.local/lib/python3.5/site-packages/cpools-0.0.0-py3.5-linux-x86_64.egg/right_pool.cpython-35m-x86_64-linux-gnu.so) frame #11: + 0x10a16 (0x7fadcf184a16 in /home/kun/.local/lib/python3.5/site-packages/cpools-0.0.0-py3.5-linux-x86_64.egg/right_pool.cpython-35m-x86_64-linux-gnu.so)

frame #15: python3() [0x4ec2e3] frame #19: python3() [0x4ec2e3] frame #21: python3() [0x4fbfce] frame #24: torch::autograd::PyFunction::apply(std::vector >&&) + 0x193 (0x7fae22e8d833 in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libtorch_python.so) frame #25: + 0x3108aa (0x7fadd8b058aa in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libtorch.so.1) frame #26: torch::autograd::Engine::evaluate_function(torch::autograd::FunctionTask&) + 0x385 (0x7fadd8afe975 in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libtorch.so.1) frame #27: torch::autograd::Engine::thread_main(torch::autograd::GraphTask*) + 0xc0 (0x7fadd8b00970 in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libtorch.so.1) frame #28: torch::autograd::Engine::thread_init(int) + 0x136 (0x7fadd8afdd46 in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libtorch.so.1) frame #29: torch::autograd::python::PythonEngine::thread_init(int) + 0x2a (0x7fae22e882fa in /home/kun/.local/lib/python3.5/site-packages/torch/lib/libtorch_python.so) frame #30: + 0xb8c80 (0x7fadda179c80 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #31: + 0x76ba (0x7fae368036ba in /lib/x86_64-linux-gnu/libpthread.so.0) frame #32: clone + 0x6d (0x7fae3653941d in /lib/x86_64-linux-gnu/libc.so.6)
Duankaiwen commented 5 years ago

please show your models/py_utils/_cpools/src/bottom_pool.cpp ............................................./left_pool.cpp ............................................./right_pool.cpp ............................................/top_pool.cpp

mama110 commented 5 years ago

bottom_pool.cpp (other 3 cpp files are similar to this one)

include <torch/extension.h>

include

std::vector pool_forward( at::Tensor input ) { // Initialize output at::Tensor output = at::zeros_like(input);

// Get height
int64_t height = input.size(2);

// Copy the last column
at::Tensor input_temp  = input.select(2, 0);
at::Tensor output_temp = output.select(2, 0);
output_temp.copy_(input_temp);

at::Tensor max_temp;
for (int64_t ind = 0; ind < height - 1; ++ind) {
    input_temp  = input.select(2, ind + 1);
    output_temp = output.select(2, ind);
    max_temp    = output.select(2, ind + 1);

    at::max_out(max_temp, input_temp, output_temp);
}

return { 
    output
};

}

std::vector pool_backward( at::Tensor input, at::Tensor grad_output ) { auto output = at::zeros_like(input);

int32_t batch   = input.size(0);
int32_t channel = input.size(1);
int32_t height  = input.size(2);
int32_t width   = input.size(3);

auto max_val = at::zeros({batch, channel, width}, torch::TensorOptions().dtype(torch::kFloat).device(torch::kCUDA));
auto max_ind = at::zeros({batch, channel, width}, torch::TensorOptions().dtype(torch::kLong).device(torch::kCUDA));

auto input_temp = input.select(2, 0);
max_val.copy_(input_temp);

max_ind.fill_(0);

auto output_temp      = output.select(2, 0);
auto grad_output_temp = grad_output.select(2, 0);
output_temp.copy_(grad_output_temp);

auto un_max_ind = max_ind.unsqueeze(2);
auto gt_mask    = at::zeros({batch, channel, width}, torch::TensorOptions().dtype(torch::kByte).device(torch::kCUDA));
auto max_temp   = at::zeros({batch, channel, width}, torch::TensorOptions().dtype(torch::kFloat).device(torch::kCUDA));
for (int32_t ind = 0; ind < height - 1; ++ind) {
    input_temp = input.select(2, ind + 1);
    at::gt_out(gt_mask, input_temp, max_val);

    at::masked_select_out(max_temp, input_temp, gt_mask);
    max_val.masked_scatter_(gt_mask, max_temp);
    max_ind.masked_fill_(gt_mask, ind + 1);

    grad_output_temp = grad_output.select(2, ind + 1).unsqueeze(2);
    output.scatter_add_(2, un_max_ind, grad_output_temp);
}

return {
    output
};

}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) { m.def( "forward", &pool_forward, "Bottom Pool Forward", py::call_guard() ); m.def( "backward", &pool_backward, "Bottom Pool Backward", py::call_guard() ); }

yulei1234 commented 5 years ago

Is the problem solved? I have encountered a similar problem...

lolongcovas commented 5 years ago

hi all, i faced the same error. I found the solution in the corner-net from princeton-vl.