Torch/torchvision incompatibility in docker when running pretrained model

jzhanson commented 3 years ago

When running evaluation with the pretrained models (or training) inside the ALFRED docker ($ python3 models/eval/eval_seq2seq.py --model_path exp/model:seq2seq_im_mask,name:base30_pm010_sg010_01/best_seen.pth --eval_split valid_seen --data data/json_feat_2.1.0 --model models.model.seq2seq_im_mask --gpu --num_threads 1) there seems to be a torch/torchvision compatibility problem with drivers (which may be due to a wonky driver setup on my end). If I'm running the pretrained model inside ai2thor-docker instead, I get the same error but I can update torch and torchvision to 1.6.0 and 0.7.0 respectively and the error goes away, leading to this issue where the Unity process crashes immediately due to a driver mismatch.

Currently having a bit of difficulty building the ALFRED docker with a newer python version (>= 3.6) that would allow me to upgrade torch and torchvision, but that's more of a "me" problem.

{'tests_seen': 1533,
 'tests_unseen': 1529,
 'train': 21023,
 'valid_seen': 820,
 'valid_unseen': 821}
Loading:  exp/model:seq2seq_im_mask,name:base30_pm010_sg010_01/best_seen.pth
Downloading: "https://download.pytorch.org/models/resnet18-5c106cde.pth" to /home/jzhanson/.cache/torch/checkpoints/resnet18-5c106cde.pth
100%|###############################################| 46827520/46827520 [00:00<00:00, 65572873.57it/s]
Traceback (most recent call last):
  File "models/eval/eval_seq2seq.py", line 54, in <module>
    eval = EvalTask(args, manager)
  File "/home/jzhanson/alfred/models/eval/eval.py", line 53, in __init__
    self.model = self.model.to(torch.device('cuda'))
  File "/home/jzhanson/alfred_env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 386, in to
    return self._apply(convert)
  File "/home/jzhanson/alfred_env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 193, in _apply
    module._apply(fn)
  File "/home/jzhanson/alfred_env/lib/python3.5/site-packages/torch/nn/modules/rnn.py", line 127, in _apply
    self.flatten_parameters()
  File "/home/jzhanson/alfred_env/lib/python3.5/site-packages/torch/nn/modules/rnn.py", line 123, in flatten_parameters
    self.batch_first, bool(self.bidirectional))
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

jzhanson commented 3 years ago

FWIW/FYI when running a torch/torchvision checkpoint from 1.6.0/0.7.0 with the torch version 1.1.0/0.3.0 of the ALFRED requirements.txt, it gives the following error

{'tests_seen': 1533,
 'tests_unseen': 1529,
 'train': 21023,
 'valid_seen': 820,
 'valid_unseen': 821}
Loading:  exp/model:seq2seq_im_mask,name:pm_and_subgoals_01/best_seen.pth
Traceback (most recent call last):
  File "/home/jzhanson/alfred_env/lib/python3.5/tarfile.py", line 181, in nti
    s = nts(s, "ascii", "strict")
  File "/home/jzhanson/alfred_env/lib/python3.5/tarfile.py", line 165, in nts
    return s.decode(encoding, errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xba in position 1: ordinal not in range(128)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jzhanson/alfred_env/lib/python3.5/tarfile.py", line 2281, in next
    tarinfo = self.tarinfo.fromtarfile(self)
  File "/home/jzhanson/alfred_env/lib/python3.5/tarfile.py", line 1083, in fromtarfile
    obj = cls.frombuf(buf, tarfile.encoding, tarfile.errors)
  File "/home/jzhanson/alfred_env/lib/python3.5/tarfile.py", line 1025, in frombuf
    chksum = nti(buf[148:156])
  File "/home/jzhanson/alfred_env/lib/python3.5/tarfile.py", line 184, in nti
    raise InvalidHeaderError("invalid header")
tarfile.InvalidHeaderError: invalid header

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jzhanson/alfred_env/lib/python3.5/site-packages/torch/serialization.py", line 556, in _load
    return legacy_load(f)
  File "/home/jzhanson/alfred_env/lib/python3.5/site-packages/torch/serialization.py", line 467, in legacy_load
    with closing(tarfile.open(fileobj=f, mode='r:', format=tarfile.PAX_FORMAT)) as tar, \
  File "/home/jzhanson/alfred_env/lib/python3.5/tarfile.py", line 1577, in open
    return func(name, filemode, fileobj, **kwargs)
  File "/home/jzhanson/alfred_env/lib/python3.5/tarfile.py", line 1607, in taropen
    return cls(name, mode, fileobj, **kwargs)
  File "/home/jzhanson/alfred_env/lib/python3.5/tarfile.py", line 1472, in __init__
    self.firstmember = self.next()
  File "/home/jzhanson/alfred_env/lib/python3.5/tarfile.py", line 2293, in next
    raise ReadError(str(e))
tarfile.ReadError: invalid header

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "models/eval/eval_seq2seq.py", line 54, in <module>
    eval = EvalTask(args, manager)
  File "/home/jzhanson/alfred/models/eval/eval.py", line 31, in __init__
    self.model, optimizer = M.Module.load(self.args.model_path)
  File "/home/jzhanson/alfred/models/model/seq2seq.py", line 318, in load
    save = torch.load(fsave)
  File "/home/jzhanson/alfred_env/lib/python3.5/site-packages/torch/serialization.py", line 387, in load
    return _load(f, map_location, pickle_module, **pickle_load_args)
  File "/home/jzhanson/alfred_env/lib/python3.5/site-packages/torch/serialization.py", line 560, in _load
    raise RuntimeError("{} is a zip archive (did you mean to use torch.jit.load()?)".format(f.name))
RuntimeError: exp/model:seq2seq_im_mask,name:pm_and_subgoals_01/best_seen.pth is a zip archive (did you mean to use torch.jit.load()?)

MohitShridhar commented 3 years ago

@jzhanson, the first one is a GPU compatibility issue: https://github.com/askforalfred/alfred/issues/26

The pre-trained models aren't compatible across pytorch versions.

askforalfred / alfred

Torch/torchvision incompatibility in docker when running pretrained model #50