PRBonn / bonnetal

Bonnet and then some! Deep Learning Framework for various Image Recognition Tasks. Photogrammetry and Robotics Lab, University of Bonn
MIT License
233 stars 60 forks source link

terminate called after throwing an instance of 'c10::Error' #9

Open karolmajek opened 5 years ago

karolmajek commented 5 years ago

Hi, I am trying to run segmentation using pretrained model. I am using docker on Ubuntu 18.04 with GPU. nvidia-smi works fine (but whole gpu mem is already used for some training in the background)

nvidia-docker run -ti --rm -e DISPLAY -v /tmp/.X11-unix:/tmp/.X11-unix -v $HOME/.Xauthority:/home/developer/.Xauthority -v /home/$USER:/home/$USER --net=host --pid=host -v /mnt/Data/dataset002mp4:/home/developer/dataset2 --ipc=host tano297/bonnetal:runtime /bin/bash

In docker:

cd deploy
catkin init
catkin build
cd ~/bonnetal/deploy/devel/lib/bonnetal_segmentation_standalone
./infer_img -p mapillary_darknet53_aspp_res_512_os8_40/ -i ~/dataset2/frames/00000001.jpg -v

I get:

================================================================================
image: /home/developer/dataset2/frames/00000001.jpg
path: mapillary_darknet53_aspp_res_512_os8_40//
backend: pytorch. Using default!
verbose: 1
================================================================================
Trying to open model
Could not send model to GPU, using CPU
terminate called after throwing an instance of 'c10::Error'
  what():  open file failed, file path: mapillary_darknet53_aspp_res_512_os8_40///model.pytorch (FileAdapter at ../caffe2/serialize/file_adapter.cc:11)
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x6c (0x7f824a5e845c in /usr/local/lib/libc10.so)
frame #1: caffe2::serialize::FileAdapter::FileAdapter(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x208 (0x7f82c2382538 in /usr/local/lib/libcaffe2.so)
frame #2: torch::jit::load(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&) + 0x40 (0x7f824b1f9250 in /usr/local/lib/libtorch.so.1)
frame #3: bonnetal::segmentation::NetPytorch::NetPytorch(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x3e1 (0x7f82c4e73171 in /home/developer/bonnetal/deploy/devel/.private/bonnetal_segmentation_lib/lib/libbonnetal_segmentation_lib.so)
frame #4: bonnetal::segmentation::make_net(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x3a6 (0x7f82c4e71926 in /home/developer/bonnetal/deploy/devel/.private/bonnetal_segmentation_lib/lib/libbonnetal_segmentation_lib.so)
frame #5: <unknown function> + 0x7dfe (0x55dc04edddfe in ./infer_img)
frame #6: __libc_start_main + 0xe7 (0x7f824bd55b97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #7: <unknown function> + 0x87ea (0x55dc04ede7ea in ./infer_img)

Aborted (core dumped)

Anything obvious? Is it related to no mem on GPU? I tried also with CUDA_VISIBLE_DEVICES='' I was looking for an example, how to use pretrained models, but haven't found any instructions. I am finally going to use these models and present results on YT. I will be very grateful for any help.

BTW, I am using docker because I have ROS1 with catkin_make and no catkin command.

tano297 commented 5 years ago

Hi,

Have you made the model directory deploy ready? Check the instructions here.

After that you should have a .pytorch and a .onnx model files in the pretrained directory!

Let me know if it works like that.

Also, I haven't tried building the workspace with catkin_make, because we use catkin internally, but if you want to give the build a shot let me know how that works. You may want to either clean your workspace or start from a fresh one with just this package inside. If you have tensorrt you can greatly benefit from running the inference natively in your pc (docker has some performance issues with gpus, I havent been able to make it run 100% of the speed of my native linux install)

karolmajek commented 4 years ago

I am back! I installed it in AWS.

While trying to convert model I get:

bonnetal/train/tasks/segmentation$ ./make_deploy_model.py -p ~/mapillary_mobilenetsv2_aspp_res_512_os8_34 -l /tmp
----------
INTERFACE:
model path /home/ubuntu/mapillary_mobilenetsv2_aspp_res_512_os8_34
log dir /tmp
Height force None
Width force None
----------

Commit hash (training version):  b'5368eed'
----------

model folder exists! Using model from /home/ubuntu/mapillary_mobilenetsv2_aspp_res_512_os8_34
[Errno 1] Operation not permitted: '.X11-unix'
Error creating log directory. Check permissions!

are you trying to empty the log directory?

2nd try:

~/bonnetal/train/tasks/segmentation$ ./make_deploy_model.py -p ~/mapillary_mobilenetsv2_aspp_res_512_os8_34 -l ~/mapillary_mobilenetsv2_aspp_res_512_os8_34/converted/
----------
INTERFACE:
model path /home/ubuntu/mapillary_mobilenetsv2_aspp_res_512_os8_34
log dir /home/ubuntu/mapillary_mobilenetsv2_aspp_res_512_os8_34/converted/
Height force None
Width force None
----------

Commit hash (training version):  b'5368eed'
----------

model folder exists! Using model from /home/ubuntu/mapillary_mobilenetsv2_aspp_res_512_os8_34
Opening config file /home/ubuntu/mapillary_mobilenetsv2_aspp_res_512_os8_34/cfg.yaml
Original OS:  32
New OS:  8.0
[Decoder] os:  4 in:  48 skip: 24 out:  24
[Decoder] os:  2 in:  24 skip: 16 out:  16
[Decoder] os:  1 in:  16 skip: 3 out:  32
Successfully loaded model backbone weights
Successfully loaded model decoder weights
Successfully loaded model head weights
Total number of parameters:  2319082
Total number of parameters requires_grad:  0
Creating dummy input to profile
Saving config file /home/ubuntu/mapillary_mobilenetsv2_aspp_res_512_os8_34/converted//cfg.yaml
Profiling model
saving model in  /home/ubuntu/mapillary_mobilenetsv2_aspp_res_512_os8_34/converted/model.onnx
../..//backbones/mobilenetv2.py:147: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if y.shape[2] < x.shape[2] or y.shape[3] < x.shape[3]:
../..//backbones/mobilenetv2.py:149: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert(x.shape[2]/y.shape[2] == x.shape[3]/y.shape[3])
Checking that it all worked out
Profiling model
/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/jit/__init__.py:745: TracerWarning: Output nr 1. of the traced function does not match the corresponding output of the Python function. Detailed error:
Not within tolerance rtol=1e-05 atol=1e-05 at input[0, 31, 15, 571] (0.05570845305919647 vs. 0.05571943521499634) and 0 other locations (0.00%)
  _check_trace([example_inputs], func, executor_options, traced, check_tolerance, _force_outplace, False)
saving model in  /home/ubuntu/mapillary_mobilenetsv2_aspp_res_512_os8_34/converted/model.pytorch

I guess it's ok:

l ~/mapillary_mobilenetsv2_aspp_res_512_os8_34/converted
cfg.yaml  model.onnx  model.pytorch

and:

./infer_video -p ~/mapillary_mobilenetsv2_aspp_res_512_os8_34/converted/ -b Pytorch --video ~/0002-20170519-2.mp4 
================================================================================
video: /home/ubuntu/0002-20170519-2.mp4
path: /home/ubuntu/mapillary_mobilenetsv2_aspp_res_512_os8_34/converted//
backend: Pytorch
verbose: 0
================================================================================
Trying to open model
Successfully opened model
Opening video/home/ubuntu/0002-20170519-2.mp4 for prediction.
================================================================================
Predicting frame: 0
================================================================================
================================================================================
Predicting frame: 1
================================================================================
================================================================================
Predicting frame: 2
================================================================================
================================================================================
Predicting frame: 3

How to get results? Where are files saved?

In python version I see log which is a dir to save output. And pytorch instead of Pytorch as a backend name

./infer_video.py -p ~/mapillary_mobilenetsv2_aspp_res_512_os8_34/converted/ -b pytorch --video ~/0002-20170519-2.mp4 -l ~/mapillary_mobilenetsv2_aspp_res_512_os8_34/results/
----------
INTERFACE:
Video /home/ubuntu/0002-20170519-2.mp4
log dir /home/ubuntu/mapillary_mobilenetsv2_aspp_res_512_os8_34/results/
model path /home/ubuntu/mapillary_mobilenetsv2_aspp_res_512_os8_34/converted/
backend pytorch
workspace 1000000000
Verbose False
Mask None
INT8 Calibration Images None
----------

Commit hash:  b'5368eed'
----------

model folder exists! Using model from /home/ubuntu/mapillary_mobilenetsv2_aspp_res_512_os8_34/converted/
Opening config file /home/ubuntu/mapillary_mobilenetsv2_aspp_res_512_os8_34/converted//cfg.yaml
Successfully Pytorch-traced model from  /home/ubuntu/mapillary_mobilenetsv2_aspp_res_512_os8_34/converted/model.pytorch
Trying to open video:  /home/ubuntu/0002-20170519-2.mp4

Finally, I am able to run it using python. Thank you so much for help!