Open meier-johannes94 opened 3 years ago
@meier-johannes94 That is much slower than the simulation should be running.
Try this:
@alters-mit
I tried running it like this and got the following error message:
jmeier@eiturtindur:~/tdw/docker_setup/tdw-transport-challenge-starter-code$ nvidia-docker run -p 443:443 --network none --env="DISPLAY=:1" --volume="/tmp/.X11-unix:/tmp/.X11-unix:rw" --volume="/tmp/output:/results" -e NVIDIA_DRIVER_CAPABILITIES=all -e TRANSPORT_CHALLENGE=file:////model_library -e NUM_EVAL_EPISODES=1 submission_image sh run_baseline_agent2.sh 7845
Set current directory to /
Found path: /TDW/TDW.x86_64
Traceback (most recent call last):
File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/urllib3/connection.py", line 175, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw
File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/urllib3/util/connection.py", line 73, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "/miniconda/envs/transport_challenge_env/lib/python3.7/socket.py", line 752, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -3] Temporary failure in name resolution
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/urllib3/connectionpool.py", line 706, in urlopen
chunked=chunked,
File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/urllib3/connectionpool.py", line 382, in _make_request
self._validate_conn(conn)
File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/urllib3/connectionpool.py", line 1010, in _validate_conn
conn.connect()
File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/urllib3/connection.py", line 358, in connect
conn = self._new_conn()
File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/urllib3/connection.py", line 187, in _new_conn
self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f5433406810>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/urllib3/connectionpool.py", line 756, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/urllib3/util/retry.py", line 574, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='pypi.org', port=443): Max retries exceeded with url: /pypi/tdw/json (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5433406810>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "test2.py", line 65, in <module>
m = Benchmark()
File "test2.py", line 13, in __init__
super().__init__(port=port, screen_height=screen_height, screen_width=screen_width)
File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/magnebot/test_controller.py", line 15, in __init__
debug=True, skip_frames=skip_frames)
File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/magnebot/magnebot_controller.py", line 336, in __init__
super().__init__(port=port, launch_build=launch_build, check_version=check_pypi_version)
File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/tdw/controller.py", line 39, in __init__
self._check_pypi_version()
File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/tdw/controller.py", line 392, in _check_pypi_version
pypi_version = PyPi.get_pypi_version()
File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/tdw/release/pypi.py", line 58, in get_pypi_version
v = PyPi._get_pypi_releases()[-1]
File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/tdw/release/pypi.py", line 43, in _get_pypi_releases
resp = get("https://pypi.org/pypi/tdw/json")
File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/requests/api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/requests/api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/requests/sessions.py", line 542, in request
resp = self.send(prep, **send_kwargs)
File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/requests/sessions.py", line 655, in send
r = adapter.send(request, **kwargs)
File "/miniconda/envs/transport_challenge_env/lib/python3.7/site-packages/requests/adapters.py", line 516, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='pypi.org', port=443): Max retries exceeded with url: /pypi/tdw/json (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5433406810>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))
ERROR conda.cli.main_run:execute(33): Subprocess for 'conda run ['python', 'test2.py', '--agent-class', 'Test', '--port', '7845']' command failed. (See above for error)
Is there still some kind of port not open?
@alters-mit I asked again the "IT department" and currently they assume that it doesn't work because of a TDW / Transport challenge specific issue. So at the moment we are kind of blocked and don't know how to continue. What is your suggestion how to proceed?
sorry closing was done by accident
@meier-johannes94 You're getting an Internet timeout error while trying to check the latest version of Magnebot on PyPi.
In the controller's constructor, set check_pypi_version=False
(this value is automatically set to False in the Transport Challenge)
from typing import List
from time import time
from magnebot.test_controller import TestController
class Benchmark(TestController):
"""
Run simple benchmarks for the average speed of an action.
In an actual use-case, the action will usually be somewhat slower because of the complexity of the scene.
"""
def __init__(self, port: int = 1071, screen_width: int = 256, screen_height: int = 256):
super().__init__(port=port, screen_height=screen_height, screen_width=screen_width, check_pypi_version=False)
self._debug = False
def move_fps(self) -> float:
"""
Benchmark the speed of `move_by()`.
:return: The average time elapsed per action.
"""
self.init_scene()
return self._get_move_fps()
def turn_fps(self) -> float:
"""
Benchmark the speed of `turn_by()`.
:return: The average time elapsed per action.
"""
self.init_scene()
times: List[float] = list()
for i in range(20):
t0 = time()
self.turn_by(45)
times.append(time() - t0)
return sum(times) / len(times)
def step_fps(self) -> None:
print("| Skipped frames | Time elapsed |\n| --- | --- |")
for frames in [0, 5, 10, 15, 20]:
self.init_scene()
self._skip_frames = frames
t = self._get_move_fps()
print(f"| {frames} | {t} |")
def _get_move_fps(self) -> float:
"""
Move backwards and forwards and get the average time elapsed per action.
:return: The average time elapsed of the action.
"""
times: List[float] = list()
direction = 1
for i in range(20):
if i > 0 and i % 5 == 0:
direction *= -1
t0 = time()
self.move_by(0.5 * direction)
times.append(time() - t0)
return sum(times) / len(times)
if __name__ == "__main__":
m = Benchmark()
print(f"turn_by(): {m.turn_fps()}")
print(f"move_by(): {m.move_fps()}")
m.step_fps()
m.end()
@alters-mit Sorry, I was ill since your reply. That's why it took longer.
I noticed that the constructor of TestController doesn't even have check_pypi_version=False
in the corresponding magnebot version. Therefore I added it when calling super.init():
from tdw.tdw_utils import TDWUtils
from magnebot.magnebot_controller import Magnebot
from magnebot.action_status import ActionStatus
class TestController(Magnebot):
"""
This controller will load an empty test room instead of a highly detailed scene.
This can be useful for testing the Magnebot.
"""
def __init__(self, port: int = 1071, screen_width: int = 256, screen_height: int = 256, skip_frames: int = 10):
super().__init__(port=port, launch_build=False, screen_height=screen_height, screen_width=screen_width,
debug=True, skip_frames=skip_frames, check_pypi_version=False)
def init_scene(self, scene: str = None, layout: int = None, room: int = None) -> ActionStatus:
"""
**Always call this function before any other API calls.** Initialize an empty test room with a Magnebot.
You can safely call `init_scene()` more than once to reset the simulation.
```python
from magnebot import TestController
m = TestController()
m.init_scene()
# Your code here.
Possible [return values](action_status.md):
- `success`
"""
self._clear_data()
commands = [{"$type": "load_scene",
"scene_name": "ProcGenScene"},
TDWUtils.create_empty_room(12, 12)]
commands.extend(self._get_scene_init_commands(
magnebot_position={"x": 0, "y": 0, "z": 0}))
resp = self.communicate(commands)
self._cache_static_data(resp=resp)
# Wait for the Magnebot to reset to its neutral position.
self._do_arm_motion()
self._end_action()
return ActionStatus.success
Before that I was also modifying ``magnebot_controller.py``. Here a piece of it:
def init(self, port: int = 1071, launch_build: bool = False, screen_width: int = 256, screen_height: int = 256,
debug: bool = False, auto_save_images: bool = False, images_directory: str = "images",
random_seed: int = None, img_is_png: bool = False, skip_frames: int = 10,
check_pypi_version: bool = True):
"""
:param port: The socket port. Read this for more information.
:param launch_build: If True, the build will launch automatically on the default port (1071). If False, you will need to launch the build yourself (for example, from a Docker container).
:param screen_width: The width of the screen in pixels.
:param screen_height: The height of the screen in pixels.
:param auto_save_images: If True, automatically save images to images_directory
at the end of every action.
:param images_directory: The output directory for images if auto_save_images == True
.
:param random_seed: The seed used for random numbers. If None, this is chosen randomly. In the Magnebot API this is used only when randomly selecting a start position for the Magnebot (see the room
parameter of init_scene()
). The same random seed is used in higher-level APIs such as the Transport Challenge.
:param debug: If True, enable debug mode. This controller will output messages to the console, including any warnings or errors sent by the build. It will also create 3D plots of arm articulation IK solutions.
:param img_is_png: If True, the img
pass images will be .png files. If False, the img
pass images will be .jpg files, which are smaller; the build will run approximately 2% faster.
:param skip_frames: The build will return output data this many physics frames per simulation frame (communicate()
call). This will greatly speed up the simulation, but eventually there will be a noticeable loss in physics accuracy. If you want to render every frame, set this to 0.
:param check_pypi_version: If True, compare the locally installed version of TDW and Magnebot to the most recent versions on PyPi.
"""
check_pypi_version = False
However when I run the build we don't go beyond "Found path ...":
(etc3.8) jmeier@eiturtindur:~/tdw/docker_setup/tdw-transport-challenge-starter-code$ nvidia-docker run --network none --env="DISPLAY=:1" --volume="/tmp/.X11-unix:/tmp/.X11-unix:rw" --volume="/tmp/.X11-unix:/tmp/.X11-unix:rw" --volume="/tmp/.X11-unix:/tmp/.X11-unix:rw" submission_image sh run_baseline_agent2.sh 7845 Set current directory to / Found path: /TDW/TDW.x86_64
Here the test2.py that is called via run_baseline_agent2.sh:
from typing import List from time import time from magnebot.test_controller import TestController
class Benchmark(TestController): """ Run simple benchmarks for the average speed of an action.
In an actual use-case, the action will usually be somewhat slower because of the complexity of the scene.
"""
def __init__(self, port: int = 7845, screen_width: int = 256, screen_height: int = 256):
super().__init__(port=port, screen_height=screen_height, screen_width=screen_width)
self._debug = False
def move_fps(self) -> float:
"""
Benchmark the speed of `move_by()`.
:return: The average time elapsed per action.
"""
self.init_scene()
return self._get_move_fps()
def turn_fps(self) -> float:
"""
Benchmark the speed of `turn_by()`.
:return: The average time elapsed per action.
"""
self.init_scene()
times: List[float] = list()
for i in range(20):
t0 = time()
self.turn_by(45)
times.append(time() - t0)
return sum(times) / len(times)
def step_fps(self) -> None:
print("| Skipped frames | Time elapsed |\n| --- | --- |")
for frames in [0, 5, 10, 15, 20]:
self.init_scene()
self._skip_frames = frames
t = self._get_move_fps()
print(f"| {frames} | {t} |")
def _get_move_fps(self) -> float:
"""
Move backwards and forwards and get the average time elapsed per action.
:return: The average time elapsed of the action.
"""
times: List[float] = list()
direction = 1
for i in range(20):
if i > 0 and i % 5 == 0:
direction *= -1
t0 = time()
self.move_by(0.5 * direction)
times.append(time() - t0)
return sum(times) / len(times)
if name == "main": m = Benchmark() print(f"turn_by(): {m.turn_fps()}") print(f"move_by(): {m.move_fps()}") m.step_fps()
m.end()
run_baseline_agent2.py:
./TDW/TDW.x86_64 -port=$1 & conda run --no-capture-output -n transport_challenge_env python test2.py
@alters-mit Is there a connection problem? If so, how can we fix it?
@abhi1092, @alters-mit
We have been running the following code in docker, which worked. So thanks a lot for fixing the issues!
However it currently takes 333 seconds to do 200 turns despite having a solid GPU and despite no other application running on the machine. So I wonder, whether TDW is running only on the CPU?
Here the GPUs:
Here the GPU driver version:
NVIDIA-SMI 470.63.01 Driver Version: 470.63.01
Here the conf-file:
To start the x-server we use:
sudo nohup Xorg :1 -config /etc/X11/xorg-1.conf
This is the nohup output:
Hera a note from the IT:
Here my command to build it:
docker build --no-cache -t submission_image .
Here my command to run it:
nvidia-docker run --network none --env="DISPLAY=:1" --volume="/tmp/.X11-unix:/tmp/.X11-unix:rw" --volume="/tmp/output:/results" -e NVIDIA_DRIVER_CAPABILITIES=all -e TRANSPORT_CHALLENGE=file:////model_library -e NUM_EVAL_EPISODES=1 submission_image sh run_baseline_agent2.sh 7845
Here
run_baseline_agent2.sh
:The performance is not usual, right? If so, how can we fix this?