Unity-Technologies / obstacle-tower-env

Obstacle Tower Environment
Apache License 2.0
540 stars 124 forks source link

GCP tutorial suggests using T4 GPU to save costs, but fails when using T4 GPU #51

Open Sohojoe opened 5 years ago

Sohojoe commented 5 years ago

Update: GCP tutorial suggests using T4 GPU to save costs, but fails when using T4 GPU (error below)


Hi, I am following the tutorial Training an Obstacle Tower agent using Dopamine and the Google Cloud Platform

I am getting the following error - I believe the problem is (EE) NVIDIA(GPU-0): UseDisplayDevice "None" is not supported with GRID - but I'm not sure of the root cause.

I was trying to use the T4 GPU to save $$ - I will try again with the default GPU

image

after typing

sudo /usr/bin/X :0 &
export DISPLAY=:0

I get this error

X.Org X Server 1.19.2
Release Date: 2017-03-02
X Protocol Version 11, Revision 0
Build Operating System: Linux 4.9.0-8-amd64 x86_64 Debian
Current Operating System: Linux tensorflow-1-vm 4.9.0-8-amd64 #1 SMP Debian 4.9.130-2 (2018-10-27) x86_64
Kernel command line: BOOT_IMAGE=/boot/vmlinuz-4.9.0-8-amd64 root=UUID=995b3d50-0ab0-4faa-8296-ab743ab0fde7 ro net.ifnames=0 biosdevname=0 console=ttyS0,38400n8 elevator=noop scsi_mod.use_blk_mq=Y
Build Date: 03 November 2018  03:09:11AM
xorg-server 2:1.19.2-1+deb9u5 (https://www.debian.org/support) 
Current version of pixman: 0.34.0
    Before reporting problems, check http://wiki.x.org
    to make sure that you have the latest version.
Markers: (--) probed, (**) from config file, (==) default setting,
    (++) from command line, (!!) notice, (II) informational,
    (WW) warning, (EE) error, (NI) not implemented, (??) unknown.
(==) Log file: "/var/log/Xorg.0.log", Time: Thu Feb 14 01:06:15 2019
(==) Using config file: "/etc/X11/xorg.conf"
(==) Using system config directory "/usr/share/X11/xorg.conf.d"
(EE) 
Fatal server error:
(EE) no screens found(EE) 

/var/log/Xorg.0.log

[   385.871] (II) Module "ramdac" already built-in
[   385.877] (**) NVIDIA(0): Depth 24, (--) framebuffer bpp 32
[   385.877] (==) NVIDIA(0): RGB weight 888
[   385.877] (==) NVIDIA(0): Default visual is TrueColor
[   385.877] (==) NVIDIA(0): Using gamma correction (1.0, 1.0, 1.0)
[   385.877] (**) NVIDIA(0): Option "UseDisplayDevice" "None"
[   385.877] (**) NVIDIA(0): Enabling 2D acceleration
[   385.877] (**) NVIDIA(0): Option "UseDisplayDevice" set to "none"; enabling NoScanout
[   385.877] (**) NVIDIA(0):     mode
[   385.877] (II) Loading sub module "glxserver_nvidia"
[   385.877] (II) LoadModule: "glxserver_nvidia"
[   385.877] (II) Loading /usr/lib/xorg/modules/extensions/libglxserver_nvidia.so
[   385.882] (II) Module glxserver_nvidia: vendor="NVIDIA Corporation"
[   385.882]    compiled for 4.0.2, module version = 1.0.0
[   385.882]    Module class: X.Org Server Extension
[   385.882] (II) NVIDIA GLX Module  410.72  Wed Oct 17 20:11:21 CDT 2018
[   386.482] (EE) NVIDIA(GPU-0): UseDisplayDevice "None" is not supported with GRID
[   386.482] (EE) NVIDIA(GPU-0):     displayless
[   386.482] (EE) NVIDIA(GPU-0): Failed to select a display subsystem.
[   386.563] (EE) NVIDIA(0): Failing initialization of X screen 0
[   386.563] (II) UnloadModule: "nvidia"
[   386.563] (II) UnloadSubModule: "glxserver_nvidia"
[   386.563] (II) Unloading glxserver_nvidia
[   386.563] (II) UnloadSubModule: "wfb"
[   386.563] (II) UnloadSubModule: "fb"
[   386.563] (EE) Screen(s) found, but none have a usable configuration.
[   386.563] (EE)
Fatal server error:
[   386.563] (EE) no screens found(EE)
[   386.563] (EE)
Please consult the The X.Org Foundation support
         at http://wiki.x.org
 for help.
[   386.563] (EE) Please also check the log file at "/var/log/Xorg.0.log" for additional information.
[   386.563] (EE)
[   386.564] (EE) Server terminated with error (1). Closing log file.
Sohojoe commented 5 years ago

OK - the problem is with the T4 GPU - I've been able to get it running with the default GPU.

It would be good to figure this out as the T4 is 1/3rd of the price

awjuliani commented 5 years ago

@ervteng Do you know about using different GPUs in this scenario?

ervteng commented 5 years ago

I've been able to use both T4 and P4 GPUs for training Unity environments (including Obstacle Tower). @Sohojoe do you have the /etc/X11/xorg.conf for the problematic machine?

Sohojoe commented 5 years ago

here you go:

# nvidia-xconfig: X configuration file generated by nvidia-xconfig
# nvidia-xconfig:  version 410.72

Section "ServerLayout"
    Identifier     "Layout0"
    Screen      0  "Screen0"
    InputDevice    "Keyboard0" "CoreKeyboard"
    InputDevice    "Mouse0" "CorePointer"
EndSection

Section "Files"
EndSection

Section "InputDevice"

    # generated from default
    Identifier     "Mouse0"
    Driver         "mouse"
    Option         "Protocol" "auto"
    Option         "Device" "/dev/psaux"
    Option         "Emulate3Buttons" "no"
    Option         "ZAxisMapping" "4 5"
EndSection

Section "InputDevice"

    # generated from default
    Identifier     "Keyboard0"
    Driver         "kbd"
EndSection

Section "Monitor"
    Identifier     "Monitor0"
    VendorName     "Unknown"
    ModelName      "Unknown"
    HorizSync       28.0 - 33.0
    VertRefresh     43.0 - 72.0
    Option         "DPMS"
EndSection

Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BoardName      "Tesla T4"
    BusID          "PCI:0:4:0"
EndSection

Section "Screen"
    Identifier     "Screen0"
    Device         "Device0"
    Monitor        "Monitor0"
    DefaultDepth    24
    Option         "UseDisplayDevice" "None"
    SubSection     "Display"
        Virtual     1280 1024
        Depth       24
    EndSubSection
EndSection

These are the options it gives me:

image
Arishtanemi2 commented 5 years ago

I've been getting the same error too.I am using a T4 and have done all the previous steps completely.Here is my xorg.conf file:

# nvidia-xconfig: X configuration file generated by nvidia-xconfig
# nvidia-xconfig:  version 410.72
Section "ServerLayout"
    Identifier     "Layout0"
    Screen      0  "Screen0"
    InputDevice    "Keyboard0" "CoreKeyboard"
    InputDevice    "Mouse0" "CorePointer"
EndSection
Section "Files"
EndSection
Section "InputDevice"
    # generated from default
    Identifier     "Mouse0"
    Driver         "mouse"
    Option         "Protocol" "auto"
    Option         "Device" "/dev/psaux"
    Option         "Emulate3Buttons" "no"
    Option         "ZAxisMapping" "4 5"
EndSection
Section "InputDevice"
    # generated from default
    Identifier     "Keyboard0"
    Driver         "kbd"
EndSection
Section "Monitor"
    Identifier     "Monitor0"
    VendorName     "Unknown"
    ModelName      "Unknown"
    HorizSync       28.0 - 33.0
    VertRefresh     43.0 - 72.0
    Option         "DPMS"
EndSection
Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BoardName      "Tesla T4"
    BusID          "0:4:0"
    Option         "AllowEmptyInitialConfiguration"
EndSection
Section "Screen"
    Identifier     "Screen0"
    Device         "Device0"
    Monitor        "Monitor0"
    DefaultDepth    24
    Option         "UseDisplayDevice" "None"
    SubSection     "Display"
        Virtual     1280 1024
        Depth       24
    EndSubSection
EndSection
MetaZhi commented 5 years ago

Any suggestion on this? I also encounter into this issue.

MetaZhi commented 5 years ago

I find the solution and it works for me:

delete or comment(with "#") ServerLayout and Screen section in /etc/X11/xorg.conf file

htdt commented 5 years ago

same issue & solution for tesla V100

juge2 commented 5 years ago

For me only removing Option "UseDisplayDevice" "none" in "Screen" Section does also the trick.

zeromodule commented 3 years ago

@zhenghongzhi @juge2 guys you've helped us so much! thank you!