devendrachaplot / Neural-SLAM

Pytorch code for ICLR-20 Paper "Learning to Explore using Active Neural SLAM"
http://www.cs.cmu.edu/~dchaplot/projects/neural-slam.html
MIT License
761 stars 144 forks source link

Error about EGL devices and CUDA #2

Closed SgtVincent closed 4 years ago

SgtVincent commented 4 years ago

Hi! I followed your installation instruction and met a bug when trying to verify the installation with command python main.py -n1 --auto_gpu_config 0 --split val under project directory. The printed out error log:

WARNING:root:This caffe2 python run does not have GPU support. Will run in CPU only mode.
Loading data/scene_datasets/gibson/Cantwell.glb
2020-07-14 20:59:15,185 initializing sim Sim-v0
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0714 20:59:15.188901 19551 WindowlessContext.cpp:98] [EGL] Detected 1 EGL devices
F0714 20:59:15.211788 19551 WindowlessContext.cpp:112] Check failed: eglDevId < numDevices [EGL] Could not find an EGL device for CUDA device -1
*** Check failure stack trace: ***
Traceback (most recent call last):
  File "main.py", line 769, in <module>
    main()
  File "main.py", line 119, in main
    envs = make_vec_envs(args)
  File "/home/chenjunting/habitat_ws/Neural-SLAM/env/__init__.py", line 7, in make_vec_envs
    envs = construct_envs(args)
  File "/home/chenjunting/habitat_ws/Neural-SLAM/env/habitat/__init__.py", line 102, in construct_envs
    range(args.num_processes))
  File "/home/chenjunting/habitat_ws/Neural-SLAM/env/habitat/habitat_api/habitat/core/vector_env.py", line 117, in __init__
    read_fn() for read_fn in self._connection_read_fns
  File "/home/chenjunting/habitat_ws/Neural-SLAM/env/habitat/habitat_api/habitat/core/vector_env.py", line 117, in <listcomp>
    read_fn() for read_fn in self._connection_read_fns
  File "/home/chenjunting/anaconda3/envs/habitat/lib/python3.6/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/home/chenjunting/anaconda3/envs/habitat/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/chenjunting/anaconda3/envs/habitat/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
Exception ignored in: <bound method VectorEnv.__del__ of <env.habitat.habitat_api.habitat.core.vector_env.VectorEnv object at 0x7f2b9f0d6eb8>>
Traceback (most recent call last):
  File "/home/chenjunting/habitat_ws/Neural-SLAM/env/habitat/habitat_api/habitat/core/vector_env.py", line 487, in __del__
    self.close()
  File "/home/chenjunting/habitat_ws/Neural-SLAM/env/habitat/habitat_api/habitat/core/vector_env.py", line 351, in close
    write_fn((CLOSE_COMMAND, None))
  File "/home/chenjunting/anaconda3/envs/habitat/lib/python3.6/multiprocessing/connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/home/chenjunting/anaconda3/envs/habitat/lib/python3.6/multiprocessing/connection.py", line 404, in _send_bytes
    self._send(header + buf)
  File "/home/chenjunting/anaconda3/envs/habitat/lib/python3.6/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

I found a related issue in habitat-sim, I tried to reinstall nvidia-driver, add CUDA_VISIBLE_DEVICES, and check all libgl*.so libraries but none of them helps. I also tried to install Neural-SLAM repo on two machines. Both failed with the same error information as posted above.

System info of machine 1:


$cat /proc/version
Linux version 4.15.0-45-generic (buildd@lcy01-amd64-027) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10)) #48~16.04.1-Ubuntu SMP Tue Jan 29 18:03:48 UTC 2019

$nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.43       Driver Version: 418.43       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:01:00.0  On |                  N/A |
|  0%   45C    P8    14W / 250W |     26MiB / 10986MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     26651      G   /usr/lib/xorg/Xorg                            24MiB |
+-----------------------------------------------------------------------------+

System info of machine 2:

$cat /proc/version
Linux version 4.4.0-185-generic (buildd@lgw01-amd64-017) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.12) ) #215-Ubuntu SMP Mon Jun 8 21:53:19 UTC 2020

$nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100      Driver Version: 440.100      CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 00000000:02:00.0 Off |                  N/A |
| 19%   29C    P0    61W / 250W |      0MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp            Off  | 00000000:03:00.0 Off |                  N/A |
| 17%   32C    P0    61W / 250W |      0MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN Xp            Off  | 00000000:81:00.0 Off |                  N/A |
| 17%   28C    P0    60W / 250W |      0MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  TITAN Xp            Off  | 00000000:82:00.0 Off |                  N/A |
| 17%   28C    P0    56W / 250W |      0MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Can you give me some hints about this bug?

devendrachaplot commented 4 years ago

Hi, This seems like an issue with habitat installation. Quick way to check this is by running examples/benchmark.py in habitat-api directory (where you installed habitat-api, not the submodule within Neural-SLAM directory). If it throws an error, it indicates habitat-sim or api is not installed correctly.