terminate called after throwing an instance of 'c10::Error'

karolmajek commented 5 years ago

Hi, I am trying to run segmentation using pretrained model. I am using docker on Ubuntu 18.04 with GPU. nvidia-smi works fine (but whole gpu mem is already used for some training in the background)

nvidia-docker run -ti --rm -e DISPLAY -v /tmp/.X11-unix:/tmp/.X11-unix -v $HOME/.Xauthority:/home/developer/.Xauthority -v /home/$USER:/home/$USER --net=host --pid=host -v /mnt/Data/dataset002mp4:/home/developer/dataset2 --ipc=host tano297/bonnetal:runtime /bin/bash

In docker:

cd deploy
catkin init
catkin build
cd ~/bonnetal/deploy/devel/lib/bonnetal_segmentation_standalone
./infer_img -p mapillary_darknet53_aspp_res_512_os8_40/ -i ~/dataset2/frames/00000001.jpg -v

I get:

================================================================================
image: /home/developer/dataset2/frames/00000001.jpg
path: mapillary_darknet53_aspp_res_512_os8_40//
backend: pytorch. Using default!
verbose: 1
================================================================================
Trying to open model
Could not send model to GPU, using CPU
terminate called after throwing an instance of 'c10::Error'
  what():  open file failed, file path: mapillary_darknet53_aspp_res_512_os8_40///model.pytorch (FileAdapter at ../caffe2/serialize/file_adapter.cc:11)
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x6c (0x7f824a5e845c in /usr/local/lib/libc10.so)
frame #1: caffe2::serialize::FileAdapter::FileAdapter(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x208 (0x7f82c2382538 in /usr/local/lib/libcaffe2.so)
frame #2: torch::jit::load(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&) + 0x40 (0x7f824b1f9250 in /usr/local/lib/libtorch.so.1)
frame #3: bonnetal::segmentation::NetPytorch::NetPytorch(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x3e1 (0x7f82c4e73171 in /home/developer/bonnetal/deploy/devel/.private/bonnetal_segmentation_lib/lib/libbonnetal_segmentation_lib.so)
frame #4: bonnetal::segmentation::make_net(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x3a6 (0x7f82c4e71926 in /home/developer/bonnetal/deploy/devel/.private/bonnetal_segmentation_lib/lib/libbonnetal_segmentation_lib.so)
frame #5: <unknown function> + 0x7dfe (0x55dc04edddfe in ./infer_img)
frame #6: __libc_start_main + 0xe7 (0x7f824bd55b97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #7: <unknown function> + 0x87ea (0x55dc04ede7ea in ./infer_img)

Aborted (core dumped)

Anything obvious? Is it related to no mem on GPU? I tried also with CUDA_VISIBLE_DEVICES='' I was looking for an example, how to use pretrained models, but haven't found any instructions. I am finally going to use these models and present results on YT. I will be very grateful for any help.

BTW, I am using docker because I have ROS1 with catkin_make and no catkin command.

tano297 commented 5 years ago

Hi,

Have you made the model directory deploy ready? Check the instructions here.

After that you should have a .pytorch and a .onnx model files in the pretrained directory!

Let me know if it works like that.

Also, I haven't tried building the workspace with catkin_make, because we use catkin internally, but if you want to give the build a shot let me know how that works. You may want to either clean your workspace or start from a fresh one with just this package inside. If you have tensorrt you can greatly benefit from running the inference natively in your pc (docker has some performance issues with gpus, I havent been able to make it run 100% of the speed of my native linux install)

karolmajek commented 4 years ago

I am back! I installed it in AWS.

While trying to convert model I get:

bonnetal/train/tasks/segmentation$ ./make_deploy_model.py -p ~/mapillary_mobilenetsv2_aspp_res_512_os8_34 -l /tmp
----------
INTERFACE:
model path /home/ubuntu/mapillary_mobilenetsv2_aspp_res_512_os8_34
log dir /tmp
Height force None
Width force None
----------

Commit hash (training version):  b'5368eed'
----------

model folder exists! Using model from /home/ubuntu/mapillary_mobilenetsv2_aspp_res_512_os8_34
[Errno 1] Operation not permitted: '.X11-unix'
Error creating log directory. Check permissions!

are you trying to empty the log directory?

2nd try:

~/bonnetal/train/tasks/segmentation$ ./make_deploy_model.py -p ~/mapillary_mobilenetsv2_aspp_res_512_os8_34 -l ~/mapillary_mobilenetsv2_aspp_res_512_os8_34/converted/
----------
INTERFACE:
model path /home/ubuntu/mapillary_mobilenetsv2_aspp_res_512_os8_34
log dir /home/ubuntu/mapillary_mobilenetsv2_aspp_res_512_os8_34/converted/
Height force None
Width force None
----------

Commit hash (training version):  b'5368eed'
----------

model folder exists! Using model from /home/ubuntu/mapillary_mobilenetsv2_aspp_res_512_os8_34
Opening config file /home/ubuntu/mapillary_mobilenetsv2_aspp_res_512_os8_34/cfg.yaml
Original OS:  32
New OS:  8.0
[Decoder] os:  4 in:  48 skip: 24 out:  24
[Decoder] os:  2 in:  24 skip: 16 out:  16
[Decoder] os:  1 in:  16 skip: 3 out:  32
Successfully loaded model backbone weights
Successfully loaded model decoder weights
Successfully loaded model head weights
Total number of parameters:  2319082
Total number of parameters requires_grad:  0
Creating dummy input to profile
Saving config file /home/ubuntu/mapillary_mobilenetsv2_aspp_res_512_os8_34/converted//cfg.yaml
Profiling model
saving model in  /home/ubuntu/mapillary_mobilenetsv2_aspp_res_512_os8_34/converted/model.onnx
../..//backbones/mobilenetv2.py:147: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if y.shape[2] < x.shape[2] or y.shape[3] < x.shape[3]:
../..//backbones/mobilenetv2.py:149: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert(x.shape[2]/y.shape[2] == x.shape[3]/y.shape[3])
Checking that it all worked out
Profiling model
/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/jit/__init__.py:745: TracerWarning: Output nr 1. of the traced function does not match the corresponding output of the Python function. Detailed error:
Not within tolerance rtol=1e-05 atol=1e-05 at input[0, 31, 15, 571] (0.05570845305919647 vs. 0.05571943521499634) and 0 other locations (0.00%)
  _check_trace([example_inputs], func, executor_options, traced, check_tolerance, _force_outplace, False)
saving model in  /home/ubuntu/mapillary_mobilenetsv2_aspp_res_512_os8_34/converted/model.pytorch

I guess it's ok:

l ~/mapillary_mobilenetsv2_aspp_res_512_os8_34/converted
cfg.yaml  model.onnx  model.pytorch

and:

./infer_video -p ~/mapillary_mobilenetsv2_aspp_res_512_os8_34/converted/ -b Pytorch --video ~/0002-20170519-2.mp4 
================================================================================
video: /home/ubuntu/0002-20170519-2.mp4
path: /home/ubuntu/mapillary_mobilenetsv2_aspp_res_512_os8_34/converted//
backend: Pytorch
verbose: 0
================================================================================
Trying to open model
Successfully opened model
Opening video/home/ubuntu/0002-20170519-2.mp4 for prediction.
================================================================================
Predicting frame: 0
================================================================================
================================================================================
Predicting frame: 1
================================================================================
================================================================================
Predicting frame: 2
================================================================================
================================================================================
Predicting frame: 3

How to get results? Where are files saved?

In python version I see log which is a dir to save output. And pytorch instead of Pytorch as a backend name

./infer_video.py -p ~/mapillary_mobilenetsv2_aspp_res_512_os8_34/converted/ -b pytorch --video ~/0002-20170519-2.mp4 -l ~/mapillary_mobilenetsv2_aspp_res_512_os8_34/results/
----------
INTERFACE:
Video /home/ubuntu/0002-20170519-2.mp4
log dir /home/ubuntu/mapillary_mobilenetsv2_aspp_res_512_os8_34/results/
model path /home/ubuntu/mapillary_mobilenetsv2_aspp_res_512_os8_34/converted/
backend pytorch
workspace 1000000000
Verbose False
Mask None
INT8 Calibration Images None
----------

Commit hash:  b'5368eed'
----------

model folder exists! Using model from /home/ubuntu/mapillary_mobilenetsv2_aspp_res_512_os8_34/converted/
Opening config file /home/ubuntu/mapillary_mobilenetsv2_aspp_res_512_os8_34/converted//cfg.yaml
Successfully Pytorch-traced model from  /home/ubuntu/mapillary_mobilenetsv2_aspp_res_512_os8_34/converted/model.pytorch
Trying to open video:  /home/ubuntu/0002-20170519-2.mp4

Finally, I am able to run it using python. Thank you so much for help!

PRBonn / bonnetal

terminate called after throwing an instance of 'c10::Error' #9