chensong1995 / HybridPose

HybridPose: 6D Object Pose Estimation under Hybrid Representation (CVPR 2020)
MIT License
415 stars 64 forks source link

using ape pretrained weight and load_dir in train_core.py #85

Closed monajalal closed 1 year ago

monajalal commented 1 year ago

I get this error:

(hp) mona@mona-ThinkStation-P7:~/HP/HybridPose$ LD_LIBRARY_PATH=lib/regressor:$LD_LIBRARY_PATH python src/train_core.py --load_dir saved_weights/linemod/ape/checkpoints/0.001/199 --object_name ape
number of model parameters: 12959563
loading checkpoint from saved_weights/linemod/ape/checkpoints/0.001/199
> /home/mona/HP/HybridPose/lib/utils.py(32)load_session()
-> print('Could not restore session properly, check the load_dir')
(Pdb) quit()
Traceback (most recent call last):
  File "/home/mona/anaconda3/envs/hp/lib/python3.7/tarfile.py", line 187, in nti
    n = int(s.strip() or "0", 8)
ValueError: invalid literal for int() with base 8: 's\n_rebui'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/mona/anaconda3/envs/hp/lib/python3.7/tarfile.py", line 2289, in next
    tarinfo = self.tarinfo.fromtarfile(self)
  File "/home/mona/anaconda3/envs/hp/lib/python3.7/tarfile.py", line 1095, in fromtarfile
    obj = cls.frombuf(buf, tarfile.encoding, tarfile.errors)
  File "/home/mona/anaconda3/envs/hp/lib/python3.7/tarfile.py", line 1037, in frombuf
    chksum = nti(buf[148:156])
  File "/home/mona/anaconda3/envs/hp/lib/python3.7/tarfile.py", line 189, in nti
    raise InvalidHeaderError("invalid header")
tarfile.InvalidHeaderError: invalid header

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/mona/anaconda3/envs/hp/lib/python3.7/site-packages/torch/serialization.py", line 555, in _load
    return legacy_load(f)
  File "/home/mona/anaconda3/envs/hp/lib/python3.7/site-packages/torch/serialization.py", line 466, in legacy_load
    with closing(tarfile.open(fileobj=f, mode='r:', format=tarfile.PAX_FORMAT)) as tar, \
  File "/home/mona/anaconda3/envs/hp/lib/python3.7/tarfile.py", line 1591, in open
    return func(name, filemode, fileobj, **kwargs)
  File "/home/mona/anaconda3/envs/hp/lib/python3.7/tarfile.py", line 1621, in taropen
    return cls(name, mode, fileobj, **kwargs)
  File "/home/mona/anaconda3/envs/hp/lib/python3.7/tarfile.py", line 1484, in __init__
    self.firstmember = self.next()
  File "/home/mona/anaconda3/envs/hp/lib/python3.7/tarfile.py", line 2301, in next
    raise ReadError(str(e))
tarfile.ReadError: invalid header

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/mona/HP/HybridPose/lib/utils.py", line 25, in load_session
    model.load_state_dict(torch.load(os.path.join(args.load_dir, 'model.pth')))
  File "/home/mona/anaconda3/envs/hp/lib/python3.7/site-packages/torch/serialization.py", line 386, in load
    return _load(f, map_location, pickle_module, **pickle_load_args)
  File "/home/mona/anaconda3/envs/hp/lib/python3.7/site-packages/torch/serialization.py", line 559, in _load
    raise RuntimeError("{} is a zip archive (did you mean to use torch.jit.load()?)".format(f.name))
RuntimeError: saved_weights/linemod/ape/checkpoints/0.001/199/model.pth is a zip archive (did you mean to use torch.jit.load()?)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "src/train_core.py", line 102, in <module>
    model, optimizer, start_epoch = setup_model(args)
  File "src/train_core.py", line 88, in setup_model
    model, optimizer, start_epoch = load_session(model, optimizer, args)
  File "/home/mona/HP/HybridPose/lib/utils.py", line 32, in load_session
    print('Could not restore session properly, check the load_dir')
  File "/home/mona/HP/HybridPose/lib/utils.py", line 32, in load_session
    print('Could not restore session properly, check the load_dir')
  File "/home/mona/anaconda3/envs/hp/lib/python3.7/bdb.py", line 88, in trace_dispatch
    return self.dispatch_line(frame)
  File "/home/mona/anaconda3/envs/hp/lib/python3.7/bdb.py", line 113, in dispatch_line
    if self.quitting: raise BdbQuit
bdb.BdbQuit

I have

(hp) mona@mona-ThinkStation-P7:~/HP/HybridPose$ ls saved_weights/linemod/ape/checkpoints/0.001/199
total 149M
drwx------ 12 mona mona 4.0K Oct 23 15:23 ..
-rw-------  1 mona mona  50M Oct 23 15:23 model.pth
drwx------  2 mona mona 4.0K Oct 23 15:23 .
-rw-------  1 mona mona  99M Oct 23 15:23 optim.pth
chensong1995 commented 1 year ago

Hi Mona,

Can you check if this link helps?

monajalal commented 1 year ago

I think I am confused because I am using the same exact version of torch and torchvision as you are using in requirements.txt. Doesn't it mean that the code should work with no further modification for loading the pth model? Thanks for any explanation:

(hp) mona@mona-ThinkStation-P7:~/HP/HybridPose$ python
Python 3.7.4 (default, Aug 13 2019, 20:35:49) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'1.2.0'
>>> import torchvision
>>> torchvision.__version__
'0.4.0a0'
monajalal commented 1 year ago

also, changed torch.load to torch.jit.load and still same problem:

def load_session(model, optim, args):
    try:
        start_epoch = int(args.load_dir.split('/')[-1]) + 1
        # model.load_state_dict(torch.load(os.path.join(args.load_dir, 'model.pth')))
        # optim.load_state_dict(torch.load(os.path.join(args.load_dir, 'optim.pth')))
        model.load_state_dict(torch.jit.load(os.path.join(args.load_dir, 'model.pth')))
        optim.load_state_dict(torch.jit.load(os.path.join(args.load_dir, 'optim.pth')))
        for param_group in optim.param_groups:
            param_group['lr'] = args.lr
        print('Successfully loaded model from {}'.format(args.load_dir))
    except Exception as e:
        pdb.set_trace()
        print('Could not restore session properly, check the load_dir')

    return model, optim, start_epoch

I have:

(hp) mona@mona-ThinkStation-P7:~/HP/HybridPose$ LD_LIBRARY_PATH=lib/regressor:$LD_LIBRARY_PATH python src/train_core.py --load_dir /home/mona/HP/HybridPose/saved_weights/linemod/ape/checkpoints/0.001/199 --object_name ape
number of model parameters: 12959563
loading checkpoint from /home/mona/HP/HybridPose/saved_weights/linemod/ape/checkpoints/0.001/199
> /home/mona/HP/HybridPose/lib/utils.py(34)load_session()
-> print('Could not restore session properly, check the load_dir')
(Pdb) quit()
Traceback (most recent call last):
  File "/home/mona/HP/HybridPose/lib/utils.py", line 27, in load_session
    model.load_state_dict(torch.jit.load(os.path.join(args.load_dir, 'model.pth')))
  File "/home/mona/anaconda3/envs/hp/lib/python3.7/site-packages/torch/jit/__init__.py", line 162, in load
    cpp_module = torch._C.import_ir_module(cu, f, map_location, _extra_files)
RuntimeError: version_number <= kMaxSupportedFileFormatVersion INTERNAL ASSERT FAILED at /tmp/pip-req-build-58y_cjjl/caffe2/serialize/inline_container.cc:131, please report a bug to PyTorch. Attempted to read a PyTorch file with version 3, but the maximum supported version for reading is 1. Your PyTorch installation may be too old. (init at /tmp/pip-req-build-58y_cjjl/caffe2/serialize/inline_container.cc:131)
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x6d (0x7fb30d0091cd in /home/mona/anaconda3/envs/hp/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: caffe2::serialize::PyTorchStreamReader::init() + 0x246d (0x7fb2defafe9d in /home/mona/anaconda3/envs/hp/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #2: caffe2::serialize::PyTorchStreamReader::PyTorchStreamReader(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x69 (0x7fb2defb1359 in /home/mona/anaconda3/envs/hp/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #3: torch::jit::import_ir_module(std::shared_ptr<torch::jit::script::CompilationUnit>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&) + 0x4d (0x7fb2e0119ddd in /home/mona/anaconda3/envs/hp/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #4: <unknown function> + 0x51031b (0x7fb2ff51031b in /home/mona/anaconda3/envs/hp/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x1c7126 (0x7fb2ff1c7126 in /home/mona/anaconda3/envs/hp/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #24: <unknown function> + 0x29d90 (0x7fb30f629d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #25: __libc_start_main + 0x80 (0x7fb30f629e40 in /lib/x86_64-linux-gnu/libc.so.6)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "src/train_core.py", line 102, in <module>
    model, optimizer, start_epoch = setup_model(args)
  File "src/train_core.py", line 88, in setup_model
    model, optimizer, start_epoch = load_session(model, optimizer, args)
  File "/home/mona/HP/HybridPose/lib/utils.py", line 34, in load_session
    print('Could not restore session properly, check the load_dir')
  File "/home/mona/HP/HybridPose/lib/utils.py", line 34, in load_session
    print('Could not restore session properly, check the load_dir')
  File "/home/mona/anaconda3/envs/hp/lib/python3.7/bdb.py", line 88, in trace_dispatch
    return self.dispatch_line(frame)
  File "/home/mona/anaconda3/envs/hp/lib/python3.7/bdb.py", line 113, in dispatch_line
    if self.quitting: raise BdbQuit
bdb.BdbQuit
(hp) mona@mona-ThinkStation-P7:~/HP/HybridPose$ ls /home/mona/HP/HybridPose/saved_weights/linemod/ape/checkpoints/0.001/199
total 149M
drwx------ 12 mona mona 4.0K Oct 23 15:23 ..
-rw-------  1 mona mona  50M Oct 23 15:23 model.pth
drwx------  2 mona mona 4.0K Oct 23 15:23 .
-rw-------  1 mona mona  99M Oct 23 15:23 optim.pth
chensong1995 commented 1 year ago

Hi Mona,

I just tried myself with PyTorch 1.2.0 and was not able to reproduce the issue. Can you calculate the md5 checksum for the two .pth files and verify if your download is good? I have:

song@xiaochong ~/D/H/s/a/c/0/199 (master)> md5sum model.pth                                                                                                                                           (hybridpose) 
96ca7f9bae5628fe551434949d5950e1  model.pth
song@xiaochong ~/D/H/s/a/c/0/199 (master)> md5sum optim.pth                                                                                                                                           (hybridpose) 
fd9e976740e63a8d2d82ce9570987712  optim.pth
monajalal commented 1 year ago

Thanks a lot for checking.

(hp) mona@mona-ThinkStation-P7:~/HP/HybridPose$ cd /home/mona/HP/HybridPose/saved_weights/linemod/ape/checkpoints/0.001/199
(hp) mona@mona-ThinkStation-P7:~/HP/HybridPose/saved_weights/linemod/ape/checkpoints/0.001/199$ md5sum model.pth   
b545a494628c9db318bd248cde2823e7  model.pth
(hp) mona@mona-ThinkStation-P7:~/HP/HybridPose/saved_weights/linemod/ape/checkpoints/0.001/199$ md5sum optim.pth  
e2ad8564e72cae263a7434a5cb73aa21  optim.pth

I have different number than yours. I downloaded the weights from OneDrive. Please let me know if there is more information needed for looking into this issue.

chensong1995 commented 1 year ago

Hi Mona,

Looks like the files you downloaded are corrupted. I just downloaded a fresh copy from the OneDrive link in README and verified the checksums are the same:

song@xiaochong ~/D/mona> ls                                                                                                                                                                                 (base) 
ape/  ape.zip
song@xiaochong ~/D/mona> cd ape/checkpoints/0.001/199/                                                                                                                                                      (base) 
song@xiaochong ~/D/m/a/c/0/199> ls                                                                                                                                                                          (base) 
model.pth  optim.pth
song@xiaochong ~/D/m/a/c/0/199> cd -                                                                                                                                                                        (base) 
song@xiaochong ~/D/mona> ls                                                                                                                                                                                 (base) 
ape/  ape.zip
song@xiaochong ~/D/mona> md5sum ape.zip                                                                                                                                                                     (base) 
295e096aee296c3497a53dfff75c22c7  ape.zip
song@xiaochong ~/D/mona> cd ape/checkpoints/0.001/199/                                                                                                                                                      (base) 
song@xiaochong ~/D/m/a/c/0/199> md5sum model.pth                                                                                                                                                            (base) 
96ca7f9bae5628fe551434949d5950e1  model.pth
song@xiaochong ~/D/m/a/c/0/199> md5sum optim.pth                                                                                                                                                            (base) 
fd9e976740e63a8d2d82ce9570987712  optim.pth
monajalal commented 1 year ago

It's very strange since I waited for each individual download to finish. I redownloaded it now and have the same exact weights, we may have had network issues. How

  model.load_state_dict(torch.load(os.path.join(args.load_dir, 'model.pth')))
  optim.load_state_dict(torch.load(os.path.join(args.load_dir, 'optim.pth')))

^^ changing back to original form.

(hp) mona@mona-ThinkStation-P7:~/HP/HybridPose$ cd /home/mona/HP/HybridPose/saved_weights/linemod/ape/checkpoints/0.001/199
(hp) mona@mona-ThinkStation-P7:~/HP/HybridPose/saved_weights/linemod/ape/checkpoints/0.001/199$ md5sum model.pth 
96ca7f9bae5628fe551434949d5950e1  model.pth
(hp) mona@mona-ThinkStation-P7:~/HP/HybridPose/saved_weights/linemod/ape/checkpoints/0.001/199$ md5sum 
model.pth  optim.pth  
(hp) mona@mona-ThinkStation-P7:~/HP/HybridPose/saved_weights/linemod/ape/checkpoints/0.001/199$ md5sum optim.pth 
fd9e976740e63a8d2d82ce9570987712  optim.pth