chuangg / tdw-transport-challenge-starter-code

27 stars 2 forks source link

Performance issue when running in docker with GPU #9

Open meier-johannes94 opened 3 years ago

meier-johannes94 commented 3 years ago

@abhi1092, @alters-mit

We have been running the following code in docker, which worked. So thanks a lot for fixing the issues!

from agent import init_logs
import pkg_resources
import pickle
import gym
import numpy as np
import time

def main():
    print("main")
    # Create gym environment.
    env = gym.make("transport_challenge-v0", train=0,
                   physics=True, port=7845, launch_build=False)

    with open(pkg_resources.resource_filename("tdw_transport_challenge", 
    "train_dataset.pkl"), 'rb') as fp:
        dataset = pickle.load(fp)

    scene_number = 0
    obs, info = env.reset(scene_info=dataset[scene_number])

    start_time = time.time()
    action = dict({"type": 1    })
    for i in range(200):
        obs, rewards, done, info = env.step(action)
    print("--- %s seconds ---" % (time.time() - start_time))

if __name__ == "__main__":
    print("start")
    main()
    print("stop")

However it currently takes 333 seconds to do 200 turns despite having a solid GPU and despite no other application running on the machine. So I wonder, whether TDW is running only on the CPU?

Here the GPUs:

xxx@xxx:/etc/X11$ nvidia-xconfig --query-gpu-info
Number of GPUs: 2

GPU #0:
  Name      : Quadro P400
  UUID      : GPU-f52d0e8c-54a9-1e48-7174-9088a3c1e3ef
  PCI BusID : PCI:101:0:0

  Number of Display Devices: 0

GPU #1:
  Name      : NVIDIA GeForce RTX 2080 Ti
  UUID      : GPU-7b7cb22c-b495-114f-be7e-0430c3c19492
  PCI BusID : PCI:179:0:0

  Number of Display Devices: 0

Here the GPU driver version: NVIDIA-SMI 470.63.01 Driver Version: 470.63.01

Here the conf-file:

xxx@xxx:/etc/X11$ cat xorg-1.conf 
# nvidia-xconfig: X configuration file generated by nvidia-xconfig
# nvidia-xconfig:  version 418.56

Section "Files"
EndSection

Section "InputDevice"
    # generated from default
    Identifier     "Mouse0"
    Driver         "mouse"
    Option         "Protocol" "auto"
    Option         "Device" "/dev/psaux"
    Option         "Emulate3Buttons" "no"
    Option         "ZAxisMapping" "4 5"
EndSection

Section "InputDevice"
    # generated from default
    Identifier     "Keyboard0"
    Driver         "kbd"
EndSection

Section "Monitor"
    Identifier     "Monitor0"
    VendorName     "Unknown"
    ModelName      "Unknown"
    HorizSync       28.0 - 33.0
    VertRefresh     43.0 - 72.0
    Option         "DPMS"
EndSection

Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BoardName      "NVIDIA GeForce RTX 2080 Ti"
    BusID          "PCI:179:0:0"
EndSection

To start the x-server we use: sudo nohup Xorg :1 -config /etc/X11/xorg-1.conf

This is the nohup output:

xxx@xxx:/etc/X11$ sudo cat nohup.out 

X.Org X Server 1.19.6
Release Date: 2017-12-20
X Protocol Version 11, Revision 0
Build Operating System: Linux 4.15.0-140-generic x86_64 Ubuntu
Current Operating System: Linux xxx 4.15.0-156-generic #163-Ubuntu SMP Thu Aug 19 23:31:58 UTC 2021 x86_64
Kernel command line: BOOT_IMAGE=/boot/vmlinuz-4.15.0-156-generic root=UUID=6fa60daf-78e4-4c2d-a1a9-665cb42094e3 ro quiet splash vt.handoff=1
Build Date: 08 April 2021  01:57:21PM
xorg-server 2:1.19.6-1ubuntu4.9 (For technical support please see http://www.ubuntu.com/support) 
Current version of pixman: 0.34.0
    Before reporting problems, check http://wiki.x.org
    to make sure that you have the latest version.
Markers: (--) probed, (**) from config file, (==) default setting,
    (++) from command line, (!!) notice, (II) informational,
    (WW) warning, (EE) error, (NI) not implemented, (??) unknown.
(==) Log file: "/var/log/Xorg.1.log", Time: Wed Nov  3 16:03:12 2021
(++) Using config file: "/etc/X11/xorg-1.conf"
(==) Using system config directory "/usr/share/X11/xorg.conf.d"

Hera a note from the IT:

according to this, the x server should be using the Titan card.. and not care about the Quadro card at all
i.e. I didnt even create a config file for the Quadro card (since you only want to use the Titan one)
(not sure if we are also supposed to delete the "InputDevice" and "Monitor" sections from the config file, but the instructions say to only delete the "ServerLayout" and "Screen" sections <- which I did). maybe Device0 is supposed to be Device1? 

Here my command to build it: docker build --no-cache -t submission_image .

Here my command to run it: nvidia-docker run --network none --env="DISPLAY=:1" --volume="/tmp/.X11-unix:/tmp/.X11-unix:rw" --volume="/tmp/output:/results" -e NVIDIA_DRIVER_CAPABILITIES=all -e TRANSPORT_CHALLENGE=file:////model_library -e NUM_EVAL_EPISODES=1 submission_image sh run_baseline_agent2.sh 7845

Here run_baseline_agent2.sh:

#!/bin/bash
./TDW/TDW.x86_64 -port=$1 &
conda run --no-capture-output -n transport_challenge_env python test.py --agent-class Test --port $1

The performance is not usual, right? If so, how can we fix this?

alters-mit commented 3 years ago

@meier-johannes94 That is much slower than the simulation should be running.

Try this:

  1. Run the Magnebot performance benchmarks and compare them to our benchmarks. If your results are much slower, that would imply that there is something unusual about your server setup.
  2. Check the output from the test you posted. Some actions take longer to complete than others. For example, if the Magnebot is repeatedly colliding with something as it's turning, that might explain why a turn action is taking long to unsuccessfully terminate.
meier-johannes94 commented 3 years ago

@alters-mit

  1. Can you help me to run this?

I tried running it like this and got the following error message:

jmeier@eiturtindur:~/tdw/docker_setup/tdw-transport-challenge-starter-code$ nvidia-docker run -p 443:443  --network none --env="DISPLAY=:1" --volume="/tmp/.X11-unix:/tmp/.X11-unix:rw" --volume="/tmp/output:/results" -e NVIDIA_DRIVER_CAPABILITIES=all -e TRANSPORT_CHALLENGE=file:////model_library -e NUM_EVAL_EPISODES=1 submission_image sh run_baseline_agent2.sh 7845
Set current directory to /
Found path: /TDW/TDW.x86_64
Traceback (most recent call last):
  File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/urllib3/connection.py", line 175, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/urllib3/util/connection.py", line 73, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/miniconda/envs/transport_challenge_env/lib/python3.7/socket.py", line 752, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -3] Temporary failure in name resolution

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/urllib3/connectionpool.py", line 706, in urlopen
    chunked=chunked,
  File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/urllib3/connectionpool.py", line 382, in _make_request
    self._validate_conn(conn)
  File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/urllib3/connectionpool.py", line 1010, in _validate_conn
    conn.connect()
  File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/urllib3/connection.py", line 358, in connect
    conn = self._new_conn()
  File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/urllib3/connection.py", line 187, in _new_conn
    self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f5433406810>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/urllib3/connectionpool.py", line 756, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/urllib3/util/retry.py", line 574, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='pypi.org', port=443): Max retries exceeded with url: /pypi/tdw/json (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5433406810>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test2.py", line 65, in <module>
    m = Benchmark()
  File "test2.py", line 13, in __init__
    super().__init__(port=port, screen_height=screen_height, screen_width=screen_width)
  File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/magnebot/test_controller.py", line 15, in __init__
    debug=True, skip_frames=skip_frames)
  File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/magnebot/magnebot_controller.py", line 336, in __init__
    super().__init__(port=port, launch_build=launch_build, check_version=check_pypi_version)
  File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/tdw/controller.py", line 39, in __init__
    self._check_pypi_version()
  File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/tdw/controller.py", line 392, in _check_pypi_version
    pypi_version = PyPi.get_pypi_version()
  File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/tdw/release/pypi.py", line 58, in get_pypi_version
    v = PyPi._get_pypi_releases()[-1]
  File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/tdw/release/pypi.py", line 43, in _get_pypi_releases
    resp = get("https://pypi.org/pypi/tdw/json")
  File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/requests/api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='pypi.org', port=443): Max retries exceeded with url: /pypi/tdw/json (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5433406810>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))
ERROR conda.cli.main_run:execute(33): Subprocess for 'conda run ['python', 'test2.py', '--agent-class', 'Test', '--port', '7845']' command failed.  (See above for error)

Is there still some kind of port not open?

  1. I printed the reward. And it is always 0. To me this suggests that it doesn't collide. Because otherwise it should be -0.1, correct?
meier-johannes94 commented 3 years ago

@alters-mit I asked again the "IT department" and currently they assume that it doesn't work because of a TDW / Transport challenge specific issue. So at the moment we are kind of blocked and don't know how to continue. What is your suggestion how to proceed?

meier-johannes94 commented 3 years ago

sorry closing was done by accident

alters-mit commented 3 years ago

@meier-johannes94 You're getting an Internet timeout error while trying to check the latest version of Magnebot on PyPi.

In the controller's constructor, set check_pypi_version=False (this value is automatically set to False in the Transport Challenge)

from typing import List
from time import time
from magnebot.test_controller import TestController

class Benchmark(TestController):
    """
    Run simple benchmarks for the average speed of an action.

    In an actual use-case, the action will usually be somewhat slower because of the complexity of the scene.
    """

    def __init__(self, port: int = 1071, screen_width: int = 256, screen_height: int = 256):
        super().__init__(port=port, screen_height=screen_height, screen_width=screen_width, check_pypi_version=False)
        self._debug = False

    def move_fps(self) -> float:
        """
        Benchmark the speed of `move_by()`.

        :return: The average time elapsed per action.
        """

        self.init_scene()
        return self._get_move_fps()

    def turn_fps(self) -> float:
        """
        Benchmark the speed of `turn_by()`.

        :return: The average time elapsed per action.
        """

        self.init_scene()
        times: List[float] = list()
        for i in range(20):
            t0 = time()
            self.turn_by(45)
            times.append(time() - t0)
        return sum(times) / len(times)

    def step_fps(self) -> None:
        print("| Skipped frames | Time elapsed |\n| --- | --- |")
        for frames in [0, 5, 10, 15, 20]:
            self.init_scene()
            self._skip_frames = frames
            t = self._get_move_fps()
            print(f"| {frames} | {t} |")

    def _get_move_fps(self) -> float:
        """
        Move backwards and forwards and get the average time elapsed per action.

        :return: The average time elapsed of the action.
        """

        times: List[float] = list()
        direction = 1
        for i in range(20):
            if i > 0 and i % 5 == 0:
                direction *= -1
            t0 = time()
            self.move_by(0.5 * direction)
            times.append(time() - t0)
        return sum(times) / len(times)

if __name__ == "__main__":
    m = Benchmark()
    print(f"turn_by(): {m.turn_fps()}")
    print(f"move_by(): {m.move_fps()}")
    m.step_fps()

    m.end()
meier-johannes94 commented 2 years ago

@alters-mit Sorry, I was ill since your reply. That's why it took longer.

I noticed that the constructor of TestController doesn't even have check_pypi_version=False in the corresponding magnebot version. Therefore I added it when calling super.init():

from tdw.tdw_utils import TDWUtils
from magnebot.magnebot_controller import Magnebot
from magnebot.action_status import ActionStatus

class TestController(Magnebot):
    """
    This controller will load an empty test room instead of a highly detailed scene.

    This can be useful for testing the Magnebot.
    """

    def __init__(self, port: int = 1071, screen_width: int = 256, screen_height: int = 256, skip_frames: int = 10):
        super().__init__(port=port, launch_build=False, screen_height=screen_height, screen_width=screen_width,
                         debug=True, skip_frames=skip_frames, check_pypi_version=False)

    def init_scene(self, scene: str = None, layout: int = None, room: int = None) -> ActionStatus:
        """
        **Always call this function before any other API calls.** Initialize an empty test room with a Magnebot.

        You can safely call `init_scene()` more than once to reset the simulation.

        ```python
        from magnebot import TestController

        m = TestController()
        m.init_scene()

        # Your code here.
    Possible [return values](action_status.md):

    - `success`
    """

    self._clear_data()

    commands = [{"$type": "load_scene",
                 "scene_name": "ProcGenScene"},
                TDWUtils.create_empty_room(12, 12)]
    commands.extend(self._get_scene_init_commands(
        magnebot_position={"x": 0, "y": 0, "z": 0}))
    resp = self.communicate(commands)
    self._cache_static_data(resp=resp)
    # Wait for the Magnebot to reset to its neutral position.
    self._do_arm_motion()
    self._end_action()
    return ActionStatus.success

Before that I was also modifying ``magnebot_controller.py``. Here a piece of it: 

def init(self, port: int = 1071, launch_build: bool = False, screen_width: int = 256, screen_height: int = 256, debug: bool = False, auto_save_images: bool = False, images_directory: str = "images", random_seed: int = None, img_is_png: bool = False, skip_frames: int = 10, check_pypi_version: bool = True): """ :param port: The socket port. Read this for more information. :param launch_build: If True, the build will launch automatically on the default port (1071). If False, you will need to launch the build yourself (for example, from a Docker container). :param screen_width: The width of the screen in pixels. :param screen_height: The height of the screen in pixels. :param auto_save_images: If True, automatically save images to images_directory at the end of every action. :param images_directory: The output directory for images if auto_save_images == True. :param random_seed: The seed used for random numbers. If None, this is chosen randomly. In the Magnebot API this is used only when randomly selecting a start position for the Magnebot (see the room parameter of init_scene()). The same random seed is used in higher-level APIs such as the Transport Challenge. :param debug: If True, enable debug mode. This controller will output messages to the console, including any warnings or errors sent by the build. It will also create 3D plots of arm articulation IK solutions. :param img_is_png: If True, the img pass images will be .png files. If False, the img pass images will be .jpg files, which are smaller; the build will run approximately 2% faster. :param skip_frames: The build will return output data this many physics frames per simulation frame (communicate() call). This will greatly speed up the simulation, but eventually there will be a noticeable loss in physics accuracy. If you want to render every frame, set this to 0. :param check_pypi_version: If True, compare the locally installed version of TDW and Magnebot to the most recent versions on PyPi. """

    check_pypi_version = False

However when I run the build we don't go beyond "Found path ...":

(etc3.8) jmeier@eiturtindur:~/tdw/docker_setup/tdw-transport-challenge-starter-code$ nvidia-docker run --network none --env="DISPLAY=:1" --volume="/tmp/.X11-unix:/tmp/.X11-unix:rw" --volume="/tmp/.X11-unix:/tmp/.X11-unix:rw" --volume="/tmp/.X11-unix:/tmp/.X11-unix:rw" submission_image sh run_baseline_agent2.sh 7845 Set current directory to / Found path: /TDW/TDW.x86_64


Here the test2.py that is called via run_baseline_agent2.sh: 

from typing import List from time import time from magnebot.test_controller import TestController

class Benchmark(TestController): """ Run simple benchmarks for the average speed of an action.

In an actual use-case, the action will usually be somewhat slower because of the complexity of the scene.
"""

def __init__(self, port: int = 7845, screen_width: int = 256, screen_height: int = 256):
    super().__init__(port=port, screen_height=screen_height, screen_width=screen_width)
    self._debug = False

def move_fps(self) -> float:
    """
    Benchmark the speed of `move_by()`.

    :return: The average time elapsed per action.
    """

    self.init_scene()
    return self._get_move_fps()

def turn_fps(self) -> float:
    """
    Benchmark the speed of `turn_by()`.

    :return: The average time elapsed per action.
    """

    self.init_scene()
    times: List[float] = list()
    for i in range(20):
        t0 = time()
        self.turn_by(45)
        times.append(time() - t0)
    return sum(times) / len(times)

def step_fps(self) -> None:
    print("| Skipped frames | Time elapsed |\n| --- | --- |")
    for frames in [0, 5, 10, 15, 20]:
        self.init_scene()
        self._skip_frames = frames
        t = self._get_move_fps()
        print(f"| {frames} | {t} |")

def _get_move_fps(self) -> float:
    """
    Move backwards and forwards and get the average time elapsed per action.

    :return: The average time elapsed of the action.
    """

    times: List[float] = list()
    direction = 1
    for i in range(20):
        if i > 0 and i % 5 == 0:
            direction *= -1
        t0 = time()
        self.move_by(0.5 * direction)
        times.append(time() - t0)
    return sum(times) / len(times)

if name == "main": m = Benchmark() print(f"turn_by(): {m.turn_fps()}") print(f"move_by(): {m.move_fps()}") m.step_fps()

m.end()

run_baseline_agent2.py: 

!/bin/bash

./TDW/TDW.x86_64 -port=$1 & conda run --no-capture-output -n transport_challenge_env python test2.py

meier-johannes94 commented 2 years ago

@alters-mit Is there a connection problem? If so, how can we fix it?