Closed jzhanson closed 4 years ago
Update: solved this problem by giving the Docker container more memory, as was done here. Now getting a similar problem to this issue where the Unity process crashes both when run through the Python code and on the command line. The output of $ python3 models/eval/eval_seq2seq.py --model_path exp/model:seq2seq_im_mask,name:pm_and_subgoals_01/best_seen.pth --eval_split valid_seen --data data/json_feat_2.1.0 --model models.model.seq2seq_im_mask --gpu --num_threads 1
is
{'tests_seen': 1533,
'tests_unseen': 1529,
'train': 21023,
'valid_seen': 820,
'valid_unseen': 821}
Loading: exp/model:seq2seq_im_mask,name:pm_and_subgoals_01/best_seen.pth
thor-201909061227-Linux64: [||||||||||||||||||||||||||||||||||||||||||| 100% 21.7 MiB/s] of 390.MB
Found path: /root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64
Mono path[0] = '/root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64_Data/Managed'
Mono config path = '/root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64_Data/Mono/etc'
Unable to preload the following plugins:
ScreenSelector.so
Display 0 'Smart Cable': 1024x768 (primary device).
PlayerPrefs - Creating folder: /root/.config/unity3d/Allen Institute for Artificial Intelligence
PlayerPrefs - Creating folder: /root/.config/unity3d/Allen Institute for Artificial Intelligence/AI2-Thor
Logging to /root/.config/unity3d/Allen Institute for Artificial Intelligence/AI2-Thor/Player.log
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.6/dist-packages/ai2thor/controller.py", line 697, in _start_unity_thread
raise Exception("command: %s exited with %s" % (command, returncode))
Exception: command: ['/root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64', '-screen-fullscreen', '0', '-screen-quality', '4', '-screen-width', '300', '-screen-height', '300'] exited with 1
And the contents of /root/.config/unity3d/Allen Institute for Artificial Intelligence/AI2-Thor/Player.log are
Desktop is 1024 x 768 @ 60 Hz
Unable to find a supported OpenGL core profile
Failed to create valid graphics context: please ensure you meet the minimum requirements
E.g. OpenGL core profile 3.2 or later for OpenGL Core renderer
Vulkan detection: 0
No supported renderers found, exiting
(Filename: Line: 634)
While running the Unity executable through the command line ($ /root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64 -screen-fullscreen 0 -screen-quality 4 -screen-width 300 -screen-height 300
) gives output
Found path: /root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64
Mono path[0] = '/root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64_Data/Managed'
Mono config path = '/root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64_Data/Mono/etc'
Unable to preload the following plugins:
ScreenSelector.so
Display 0 'Smart Cable': 1024x768 (primary device).
Logging to /root/.config/unity3d/Allen Institute for Artificial Intelligence/AI2-Thor/Player.log
and also crashes and /root/.config/unity3d/Allen Institute for Artificial Intelligence/AI2-Thor/Player.log
contains
Desktop is 1024 x 768 @ 60 Hz
Unable to find a supported OpenGL core profile
Failed to create valid graphics context: please ensure you meet the minimum requirements
E.g. OpenGL core profile 3.2 or later for OpenGL Core renderer
Vulkan detection: 0
No supported renderers found, exiting
(Filename: Line: 634)
How did you start the docker container? Did you use the scripts/run.sh script from the ai2thor-docker repo? Do you have a running X11 server within the container? The example_agent.py will launch this in a separate thread. What model GPU are you running with?
Thanks for the response.
How did you start the docker container? Did you use the scripts/run.sh script from the ai2thor-docker repo?
Yes, I was using a modified scripts/run.sh, the only differences being the addition of --shm-size 8G
and running bash
instead of python3 example_agent.py
.
Do you have a running X11 server within the container? The example_agent.py will launch this in a separate thread.
No when running example_agent.py, yes when running the ALFRED evaluation code. There is also an Xorg process running on the machine that isn't mine — its parent process is gdm-xsession, and its great-grandparent process is gdm3, so I'm pretty sure that has to do with the desktop GUI <-> GPU interface.
What model GPU are you running with?
Quadro RTX 8000, and the versions for OpenGL I get from glxinfo
after running the startx.py
script in the background are
server glx version string: 1.4
client glx version string: 1.4
GLX version: 1.4
OpenGL core profile version string: 4.6.0 NVIDIA 450.57
OpenGL core profile shading language version string: 4.60 NVIDIA
OpenGL version string: 4.6.0 NVIDIA 450.57
OpenGL shading language version string: 4.60 NVIDIA
Would it be possible to try running the ai2thor-docker example but remove the $X11_PARAMS argument to docker run?
Thanks for the suggestion — unfortunately it seems like the same problem is still sticking around. I ran docker run --privileged --shm-size 8G -it ai2thor-docker:latest bash
and then python3 example_agent.py
within it and the output is:
root@17c36349fe51:/app# python3 example_agent.py
X.Org X Server 1.19.6
Release Date: 2017-12-20
X Protocol Version 11, Revision 0
Build Operating System: Linux 4.15.0-115-generic x86_64 Ubuntu
Current Operating System: Linux 17c36349fe51 5.4.0-42-generic #46~18.04.1-Ubuntu SMP Fri Jul 10 07:21:24 UTC 2020 x86_64
Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.4.0-42-generic root=UUID=fd242869-5ed6-4010-a7c4-9171df38a426 ro quiet splash vt.handoff=1
Build Date: 04 September 2020 03:34:39PM
xorg-server 2:1.19.6-1ubuntu4.6 (For technical support please see http://www.ubuntu.com/support)
Current version of pixman: 0.34.0
Before reporting problems, check http://wiki.x.org
to make sure that you have the latest version.
Markers: (--) probed, (**) from config file, (==) default setting,
(++) from command line, (!!) notice, (II) informational,
(WW) warning, (EE) error, (NI) not implemented, (??) unknown.
(==) Log file: "/var/log/Xorg.0.log", Time: Wed Sep 30 01:13:31 2020
(++) Using config file: "/tmp/tmp09fv348b"
(==) Using system config directory "/usr/share/X11/xorg.conf.d"
PlayerPrefs - Creating folder: /root/.config/unity3d/unknown
PlayerPrefs - Creating folder: /root/.config/unity3d/unknown/unknown
Unable to load player prefs
Found path: /root/.ai2thor/releases/thor-Linux64-8db5080010a07f037367ad6be0fd83d8f5f75240/thor-Linux64-8db5080010a07f037367ad6be0fd83d8f5f75240
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/ai2thor/wsgi_server.py", line 39, in queue_get
res = que.get(block=True, timeout=0.5)
File "/usr/lib/python3.6/queue.py", line 172, in get
raise Empty
queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "example_agent.py", line 8, in <module>
controller = ai2thor.controller.Controller(scene='FloorPlan28')
File "/usr/local/lib/python3.6/dist-packages/ai2thor/controller.py", line 426, in __init__
host=host
File "/usr/local/lib/python3.6/dist-packages/ai2thor/controller.py", line 929, in start
self.last_event = self.server.receive()
File "/usr/local/lib/python3.6/dist-packages/ai2thor/wsgi_server.py", line 212, in receive
return queue_get(self.request_queue, self.unity_proc)
File "/usr/local/lib/python3.6/dist-packages/ai2thor/wsgi_server.py", line 45, in queue_get
raise Exception("Unity process exited %s" % unity_proc.returncode)
Exception: Unity process exited 1
Drivers are the same version as well, if I test them by running startx.py + glxinfo
.
Can you try this: https://github.com/askforalfred/alfred#run-headless
It works — I still have to deal with the cuDNN error issue, but it looks like thor and evaluation is running fine now. Thank you, @MohitShridhar!
Closing this issue but I'll leave its sibling on the ai2thor-docker repo open, since not sure if it's solved when using ai2thor-docker.
@jzhanson regarding the CUDNN error, are you using a RTX 2080? It seems CUDA9 is not compatible with RTX 2080s: https://github.com/pytorch/pytorch/issues/17543
No, I'm using a Quadro RTX 8000, but the ALFRED docker does install CUDA 9.0. I'll play around with putting CUDA 11.0 into it with FROM nvidia/cuda:11.0-devel-ubuntu18.04
today.
No dice for that upgraded cuda, I get the same Unity crash as above.
I also tried giving the Docker container more memory by adding cmd += ' --shm-size 40G'
to scripts/docker_run.py
.
I'll try upgrading cuda to 11.0 and then building pytorch with that.
Tried with updated torch as well but still Unity crash, so I threw up my hands and wrapped every model.to(torch.device('cuda'))
and torch.device('cuda')
call in try
/except
and it works fine.
Maybe I can use the CUDA 11.0 and torch 1.6.0 and torchvision 0.7.0 with a bit more hacking on nvidia-xconfig
but I'm going to table this for now.
I've been following along with #48 since I'm also trying to run ALFRED evaluation with THOR on a headless machine where I don't have root access. So far, I've modified the ai2thor-docker repo so that it installs ai2thor==2.1.0 (I had to also add
RUN pip3 install --upgrade torch torchvision
to the Dockerfile because there were some compatibility issues with the pytorch being 1.1.0 instead of 1.6.0 and torchvision being 0.3.0 instead of 0.7.0, since I was getting errors like).
I started by doing a pretty naive approach where I just moved my ALFRED repo with the quickstart data and the model checkpoints I wanted to evaluate into the Docker build context and copying all of it into the Docker image (which takes a while, but that's a "me" problem). Unfortunately, I get a bus error when attempting to run evaluation on my saved checkpoint, even if I generate the checkpoint by training inside the Docker container:
Update: Tried cloning the alfred repo and downloading the data from inside the docker and training from scratch, but same issue.
The reason I used torch==1.6.0 and torchvision==0.7.0 instead of torch==1.1.0 and torchvision==0.3.0 is that it silences the error
I suspect the bus error has to do with the version differences, but I'm not quite sure yet.