askforalfred / alfred

ALFRED - A Benchmark for Interpreting Grounded Instructions for Everyday Tasks
MIT License
375 stars 84 forks source link

UPDATE: Unity process crashes with driver mismatch inside ai2thor-docker with startx.py, Ubuntu 18.04 #49

Closed jzhanson closed 4 years ago

jzhanson commented 4 years ago

I've been following along with #48 since I'm also trying to run ALFRED evaluation with THOR on a headless machine where I don't have root access. So far, I've modified the ai2thor-docker repo so that it installs ai2thor==2.1.0 (I had to also add RUN pip3 install --upgrade torch torchvision to the Dockerfile because there were some compatibility issues with the pytorch being 1.1.0 instead of 1.6.0 and torchvision being 0.3.0 instead of 0.7.0, since I was getting errors like

{'tests_seen': 1533,
 'tests_unseen': 1529,
 'train': 21023,
 'valid_seen': 820,
 'valid_unseen': 821}
Loading:  exp/model:seq2seq_im_mask,name:pm_and_subgoals_01/best_seen.pth
Traceback (most recent call last):
  File "/usr/lib/python3.6/tarfile.py", line 188, in nti
    s = nts(s, "ascii", "strict")
  File "/usr/lib/python3.6/tarfile.py", line 172, in nts
    return s.decode(encoding, errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xba in position 1: ordinal not in range(128)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/tarfile.py", line 2299, in next
    tarinfo = self.tarinfo.fromtarfile(self)
  File "/usr/lib/python3.6/tarfile.py", line 1093, in fromtarfile
    obj = cls.frombuf(buf, tarfile.encoding, tarfile.errors)
  File "/usr/lib/python3.6/tarfile.py", line 1035, in frombuf
    chksum = nti(buf[148:156])
  File "/usr/lib/python3.6/tarfile.py", line 191, in nti
    raise InvalidHeaderError("invalid header")
tarfile.InvalidHeaderError: invalid header

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 556, in _load
    return legacy_load(f)
  File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 467, in legacy_load
    with closing(tarfile.open(fileobj=f, mode='r:', format=tarfile.PAX_FORMAT)) as tar, \
  File "/usr/lib/python3.6/tarfile.py", line 1591, in open
    return func(name, filemode, fileobj, **kwargs)
  File "/usr/lib/python3.6/tarfile.py", line 1621, in taropen
    return cls(name, mode, fileobj, **kwargs)
  File "/usr/lib/python3.6/tarfile.py", line 1484, in __init__
    self.firstmember = self.next()
  File "/usr/lib/python3.6/tarfile.py", line 2311, in next
    raise ReadError(str(e))
tarfile.ReadError: invalid header

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "models/eval/eval_seq2seq.py", line 54, in <module>
    eval = EvalTask(args, manager)
  File "/app/alfred/models/eval/eval.py", line 31, in __init__
    self.model, optimizer = M.Module.load(self.args.model_path)
  File "/app/alfred/models/model/seq2seq.py", line 318, in load
    save = torch.load(fsave)
  File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 387, in load
    return _load(f, map_location, pickle_module, **pickle_load_args)
  File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 560, in _load
    raise RuntimeError("{} is a zip archive (did you mean to use torch.jit.load()?)".format(f.name))
RuntimeError: exp/model:seq2seq_im_mask,name:pm_and_subgoals_01/best_seen.pth is a zip archive (did you mean to use torch.jit.load()?)

).

I started by doing a pretty naive approach where I just moved my ALFRED repo with the quickstart data and the model checkpoints I wanted to evaluate into the Docker build context and copying all of it into the Docker image (which takes a while, but that's a "me" problem). Unfortunately, I get a bus error when attempting to run evaluation on my saved checkpoint, even if I generate the checkpoint by training inside the Docker container:

{'tests_seen': 1533,
 'tests_unseen': 1529,
 'train': 21023,
 'valid_seen': 820,
 'valid_unseen': 821}
Loading:  exp/model:seq2seq_im_mask,name:pm_and_subgoals_01/best_seen.pth
./test.sh: line 3:   117 Bus error               (core dumped) python3 models/eval/eval_seq2seq.py --model_path exp/model:seq2seq_im_mask,name:pm_and_subgoals_01/best_seen.pth --eval_split valid_seen --data data/json_feat_2.1.0 --model models.model.seq2seq_im_mask --gpu --num_threads 1

Update: Tried cloning the alfred repo and downloading the data from inside the docker and training from scratch, but same issue.

The reason I used torch==1.6.0 and torchvision==0.7.0 instead of torch==1.1.0 and torchvision==0.3.0 is that it silences the error

Traceback (most recent call last):
  File "models/train/train_seq2seq.py", line 103, in <module>
    model = model.to(torch.device('cuda'))
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 386, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 193, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/rnn.py", line 127, in _apply
    self.flatten_parameters()
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/rnn.py", line 123, in flatten_parameters
    self.batch_first, bool(self.bidirectional))
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

I suspect the bus error has to do with the version differences, but I'm not quite sure yet.

jzhanson commented 4 years ago

Update: solved this problem by giving the Docker container more memory, as was done here. Now getting a similar problem to this issue where the Unity process crashes both when run through the Python code and on the command line. The output of $ python3 models/eval/eval_seq2seq.py --model_path exp/model:seq2seq_im_mask,name:pm_and_subgoals_01/best_seen.pth --eval_split valid_seen --data data/json_feat_2.1.0 --model models.model.seq2seq_im_mask --gpu --num_threads 1 is

{'tests_seen': 1533,
 'tests_unseen': 1529,
 'train': 21023,
 'valid_seen': 820,
 'valid_unseen': 821}
Loading:  exp/model:seq2seq_im_mask,name:pm_and_subgoals_01/best_seen.pth
thor-201909061227-Linux64: [||||||||||||||||||||||||||||||||||||||||||| 100%  21.7 MiB/s]  of 390.MB
Found path: /root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64
Mono path[0] = '/root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64_Data/Managed'
Mono config path = '/root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64_Data/Mono/etc'
Unable to preload the following plugins:
        ScreenSelector.so
Display 0 'Smart Cable': 1024x768 (primary device).
PlayerPrefs - Creating folder: /root/.config/unity3d/Allen Institute for Artificial Intelligence
PlayerPrefs - Creating folder: /root/.config/unity3d/Allen Institute for Artificial Intelligence/AI2-Thor
Logging to /root/.config/unity3d/Allen Institute for Artificial Intelligence/AI2-Thor/Player.log
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.6/dist-packages/ai2thor/controller.py", line 697, in _start_unity_thread
    raise Exception("command: %s exited with %s" % (command, returncode))
Exception: command: ['/root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64', '-screen-fullscreen', '0', '-screen-quality', '4', '-screen-width', '300', '-screen-height', '300'] exited with 1

And the contents of /root/.config/unity3d/Allen Institute for Artificial Intelligence/AI2-Thor/Player.log are

Desktop is 1024 x 768 @ 60 Hz
Unable to find a supported OpenGL core profile
Failed to create valid graphics context: please ensure you meet the minimum requirements
E.g. OpenGL core profile 3.2 or later for OpenGL Core renderer
Vulkan detection: 0
No supported renderers found, exiting

(Filename:  Line: 634)

While running the Unity executable through the command line ($ /root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64 -screen-fullscreen 0 -screen-quality 4 -screen-width 300 -screen-height 300) gives output

Found path: /root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64
Mono path[0] = '/root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64_Data/Managed'
Mono config path = '/root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64_Data/Mono/etc'
Unable to preload the following plugins:
        ScreenSelector.so
Display 0 'Smart Cable': 1024x768 (primary device).
Logging to /root/.config/unity3d/Allen Institute for Artificial Intelligence/AI2-Thor/Player.log

and also crashes and /root/.config/unity3d/Allen Institute for Artificial Intelligence/AI2-Thor/Player.log contains

Desktop is 1024 x 768 @ 60 Hz
Unable to find a supported OpenGL core profile
Failed to create valid graphics context: please ensure you meet the minimum requirements
E.g. OpenGL core profile 3.2 or later for OpenGL Core renderer
Vulkan detection: 0
No supported renderers found, exiting

(Filename:  Line: 634)
ekolve commented 4 years ago

How did you start the docker container? Did you use the scripts/run.sh script from the ai2thor-docker repo? Do you have a running X11 server within the container? The example_agent.py will launch this in a separate thread. What model GPU are you running with?

jzhanson commented 4 years ago

Thanks for the response.

How did you start the docker container? Did you use the scripts/run.sh script from the ai2thor-docker repo?

Yes, I was using a modified scripts/run.sh, the only differences being the addition of --shm-size 8G and running bash instead of python3 example_agent.py.

Do you have a running X11 server within the container? The example_agent.py will launch this in a separate thread.

No when running example_agent.py, yes when running the ALFRED evaluation code. There is also an Xorg process running on the machine that isn't mine — its parent process is gdm-xsession, and its great-grandparent process is gdm3, so I'm pretty sure that has to do with the desktop GUI <-> GPU interface.

What model GPU are you running with?

Quadro RTX 8000, and the versions for OpenGL I get from glxinfo after running the startx.py script in the background are

server glx version string: 1.4
client glx version string: 1.4
GLX version: 1.4
OpenGL core profile version string: 4.6.0 NVIDIA 450.57
OpenGL core profile shading language version string: 4.60 NVIDIA
OpenGL version string: 4.6.0 NVIDIA 450.57
OpenGL shading language version string: 4.60 NVIDIA
ekolve commented 4 years ago

Would it be possible to try running the ai2thor-docker example but remove the $X11_PARAMS argument to docker run?

jzhanson commented 4 years ago

Thanks for the suggestion — unfortunately it seems like the same problem is still sticking around. I ran docker run --privileged --shm-size 8G -it ai2thor-docker:latest bash and then python3 example_agent.py within it and the output is:

root@17c36349fe51:/app# python3 example_agent.py

X.Org X Server 1.19.6
Release Date: 2017-12-20
X Protocol Version 11, Revision 0
Build Operating System: Linux 4.15.0-115-generic x86_64 Ubuntu
Current Operating System: Linux 17c36349fe51 5.4.0-42-generic #46~18.04.1-Ubuntu SMP Fri Jul 10 07:21:24 UTC 2020 x86_64
Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.4.0-42-generic root=UUID=fd242869-5ed6-4010-a7c4-9171df38a426 ro quiet splash vt.handoff=1
Build Date: 04 September 2020  03:34:39PM
xorg-server 2:1.19.6-1ubuntu4.6 (For technical support please see http://www.ubuntu.com/support)
Current version of pixman: 0.34.0
        Before reporting problems, check http://wiki.x.org
        to make sure that you have the latest version.
Markers: (--) probed, (**) from config file, (==) default setting,
        (++) from command line, (!!) notice, (II) informational,
        (WW) warning, (EE) error, (NI) not implemented, (??) unknown.
(==) Log file: "/var/log/Xorg.0.log", Time: Wed Sep 30 01:13:31 2020
(++) Using config file: "/tmp/tmp09fv348b"
(==) Using system config directory "/usr/share/X11/xorg.conf.d"
PlayerPrefs - Creating folder: /root/.config/unity3d/unknown
PlayerPrefs - Creating folder: /root/.config/unity3d/unknown/unknown
Unable to load player prefs
Found path: /root/.ai2thor/releases/thor-Linux64-8db5080010a07f037367ad6be0fd83d8f5f75240/thor-Linux64-8db5080010a07f037367ad6be0fd83d8f5f75240
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ai2thor/wsgi_server.py", line 39, in queue_get
    res = que.get(block=True, timeout=0.5)
  File "/usr/lib/python3.6/queue.py", line 172, in get
    raise Empty
queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "example_agent.py", line 8, in <module>
    controller = ai2thor.controller.Controller(scene='FloorPlan28')
  File "/usr/local/lib/python3.6/dist-packages/ai2thor/controller.py", line 426, in __init__
    host=host
  File "/usr/local/lib/python3.6/dist-packages/ai2thor/controller.py", line 929, in start
    self.last_event = self.server.receive()
  File "/usr/local/lib/python3.6/dist-packages/ai2thor/wsgi_server.py", line 212, in receive
    return queue_get(self.request_queue, self.unity_proc)
  File "/usr/local/lib/python3.6/dist-packages/ai2thor/wsgi_server.py", line 45, in queue_get
    raise Exception("Unity process exited %s" % unity_proc.returncode)
Exception: Unity process exited 1

Drivers are the same version as well, if I test them by running startx.py + glxinfo.

MohitShridhar commented 4 years ago

Can you try this: https://github.com/askforalfred/alfred#run-headless

jzhanson commented 4 years ago

It works — I still have to deal with the cuDNN error issue, but it looks like thor and evaluation is running fine now. Thank you, @MohitShridhar!

Closing this issue but I'll leave its sibling on the ai2thor-docker repo open, since not sure if it's solved when using ai2thor-docker.

MohitShridhar commented 4 years ago

@jzhanson regarding the CUDNN error, are you using a RTX 2080? It seems CUDA9 is not compatible with RTX 2080s: https://github.com/pytorch/pytorch/issues/17543

jzhanson commented 4 years ago

No, I'm using a Quadro RTX 8000, but the ALFRED docker does install CUDA 9.0. I'll play around with putting CUDA 11.0 into it with FROM nvidia/cuda:11.0-devel-ubuntu18.04 today.

jzhanson commented 4 years ago

No dice for that upgraded cuda, I get the same Unity crash as above.

I also tried giving the Docker container more memory by adding cmd += ' --shm-size 40G' to scripts/docker_run.py.

I'll try upgrading cuda to 11.0 and then building pytorch with that.

jzhanson commented 4 years ago

Tried with updated torch as well but still Unity crash, so I threw up my hands and wrapped every model.to(torch.device('cuda')) and torch.device('cuda') call in try/except and it works fine.

Maybe I can use the CUDA 11.0 and torch 1.6.0 and torchvision 0.7.0 with a bit more hacking on nvidia-xconfig but I'm going to table this for now.