DavidPL1 / assembly_example

Example code for interaction with our assembly simulation ICRA 2023 challenge
7 stars 1 forks source link

GPU availability and difficulty level #5

Closed renyu2016 closed 1 year ago

renyu2016 commented 1 year ago

Hi! I have some questions as follows:

  1. Can we use GPU in our solution? Are there any limitations for GPU devices?
  2. The descriptions of the "difficulty_level" parameter say "1-3 for screwing, 1-4 for plugging", but when I run the simulation environment with difficulty_level:=4 for plugging task, it shows that "IndexError: tuple index out of range".
DavidPL1 commented 1 year ago

Hi,

  1. I assume you are asking if you can use GPUs to train deep models. As the training and evaluation processes are two completely separate things, for training you can use as many GPU resources as you want. The evaluation pipeline is not fully fleshed out at the moment, so I can't tell you if GPUs will be available for that part, but in case they won't most common deep learning frameworks allow inference on CPU only. This might require some implementation changes in the docker image you will hand in. But we will notify you (most probably as email announcement) once the details are settled.
  2. Could you please share some more of the log output? Then @balandbal will have a better idea of where the problematic code is and get back to you on that matter.
renyu2016 commented 1 year ago

Hi,

  1. I assume you are asking if you can use GPUs to train deep models. As the training and evaluation processes are two completely separate things, for training you can use as many GPU resources as you want. The evaluation pipeline is not fully fleshed out at the moment, so I can't tell you if GPUs will be available for that part, but in case they won't most common deep learning frameworks allow inference on CPU only. This might require some implementation changes in the docker image you will hand in. But we will notify you (most probably as email announcement) once the details are settled.
  2. Could you please share some more of the log output? Then @balandbal will have a better idea of where the problematic code is and get back to you on that matter.

Thanks for your reply.

For question 2: my running command is: docker run --rm --net=host -it s4dx/assembly_server headless:=false eval_mode:=true difficulty_level:=4 task:=plugging

The outputs in the terminal are as follows:

_XSERVTransSocketUNIXCreateListener: ...SocketCreateListener() failed _XSERVTransMakeAllCOTSServerListeners: server already running (EE) Fatal server error: (EE) Cannot establish any listening sockets - Make sure an X server isn't already running(EE) Display exists roslaunch assembly_manager assembly_manager.launch headless:=false eval_mode:=true difficulty_level:=4 task:=plugging ... logging to /root/.ros/log/b831f3f2-bfad-11ed-993b-2cf05daf3f1a/roslaunch-dwmsyu-1.log Checking log directory for disk usage. This may take a while. Press Ctrl-C to interrupt Done checking log file disk usage. Usage is <1GB.

started roslaunch server http://dwmsyu:36581/

SUMMARY

PARAMETERS

  • /assembly_manager/difficulty_level: 4
  • /assembly_manager/eval_mode: True
  • /assembly_manager/headless: False
  • /assembly_manager/save_results_path:
  • /assembly_manager/seed: -1
  • /assembly_manager/task: plugging
  • /assembly_manager/verbose: False
  • /rosdistro: noetic
  • /rosversion: 1.15.15

NODES / assembly_manager (assembly_manager/assembly_manager)

auto-starting new master process[master]: started with pid [35] ROS_MASTER_URI=http://localhost:11311

setting /run_id to b831f3f2-bfad-11ed-993b-2cf05daf3f1a process[rosout-1]: started with pid [45] started core service [/rosout] process[assembly_manager-2]: started with pid [52] Traceback (most recent call last): File "/home/assembly_server/lib/assembly_manager/assembly_manager", line 5, in main() File "", line 34, in main File "", line 182, in from_param_server IndexError: tuple index out of range [assembly_manager-2] process has died [pid 52, exit code 1, cmd /home/assembly_server/lib/assembly_manager/assembly_manager __name:=assembly_manager __log:=/root/.ros/log/b831f3f2-bfad-11ed-993b-2cf05daf3f1a/assembly_manager-2.log]. log file: /root/.ros/log/b831f3f2-bfad-11ed-993b-2cf05daf3f1a/assembly_manager-2*.log

The related information in log file :

[rospy.client][INFO] 2023-03-11 01:43:57,760: init_node, name[/assembly_manager], pid[53] [xmlrpc][INFO] 2023-03-11 01:43:57,760: XML-RPC server binding to 0.0.0.0:0 [xmlrpc][INFO] 2023-03-11 01:43:57,760: Started XML-RPC server [http://dwmsyu:40839/] [rospy.impl.masterslave][INFO] 2023-03-11 01:43:57,760: _ready: http://dwmsyu:40839/ [rospy.init][INFO] 2023-03-11 01:43:57,760: ROS Slave URI: [http://dwmsyu:40839/] [rospy.registration][INFO] 2023-03-11 01:43:57,761: Registering with master node http://localhost:11311 [xmlrpc][INFO] 2023-03-11 01:43:57,761: xml rpc node: starting XML-RPC server [rospy.init][INFO] 2023-03-11 01:43:57,861: registered with master [rospy.rosout][INFO] 2023-03-11 01:43:57,861: initializing /rosout core topic [rospy.rosout][INFO] 2023-03-11 01:43:57,871: connected to core topic /rosout [rospy.simtime][INFO] 2023-03-11 01:43:57,874: /use_sim_time is not set, will not subscribe to simulated time [/clock] topic [rospy.core][INFO] 2023-03-11 01:43:57,970: signal_shutdown [atexit] [rospy.impl.masterslave][INFO] 2023-03-11 01:43:57,971: atexit

balandbal commented 1 year ago

So the argument difficulty_level is an integer that is directly used to index a list of the difficulty-level configurations, at least for the plugging task. Therefore, per version 1.1 of task docker image,

Hence the "out of range" error with index 4.

It would make sense to make this more "human-friendly" (and to harmonize it with the screwing task) and use the level as a 1-based numbering. The change might be part of the next release.

balandbal commented 1 year ago

@DavidPL1 I have tried to launch the screwing task with arbitrary difficulty levels, e.g., 12, and there was no error/warning message of it being out of scope. Is this the intended behavior?

DavidPL1 commented 1 year ago

I have tried to launch the screwing task with arbitrary difficulty levels, e.g., 12, and there was no error/warning message of it being out of scope. Is this the intended behavior?

I've looked at the code and it seems the configuration is invalid if the level is set above 4. We probably should print a detailed error message and abort the launch if an undefined level is configured.

Let's maybe aim for a new release including improvements for both task launchers by mid of next week that will possibly also include the solution to #4 if I manage to come up with a fix until then.

renyu2016 commented 1 year ago

Get it!Thanks again.

I have tried to launch the screwing task with arbitrary difficulty levels, e.g., 12, and there was no error/warning message of it being out of scope. Is this the intended behavior?

I've looked at the code and it seems the configuration is invalid if the level is set above 4. We probably should print a detailed error message and abort the launch if an undefined level is configured.

Let's maybe aim for a new release including improvements for both task launchers by mid of next week that will possibly also include the solution to #4 if I manage to come up with a fix until then. Looking forward to the new release, thanks!

DavidPL1 commented 1 year ago

The difficulty level discrepancy was fixed in the now available 2.0 Release.

As for the GPU availability: I haven't gotten a definitive OK from the guys in charge of the servers, but I had an unofficial confirmation that a GPU most probably should be available in our evaluation stage. You will get updated once it is made official.

Director-of-G commented 1 year ago

@DavidPL1 @renyu2016 Hi, we were trying to run the server with --net=host but encountered an error, the terminal output included similar lines as this issue, which appears like

Fatal server error: (EE) Cannot establish any listening sockets - Make sure an X server isn't already running(EE) Display exists

Actually we got the above error on an Ubuntu 20.04 workstation with NVIDIA driver installed. We also tested on another Ubuntu 20.04 workstation without NVIDIA driver, the bug disappeared and the server was launched successfully.

I wonder if @renyu2016 faced the same issue. If so, have you guys solved the problem. Besides, could we run the docker image with --net=host with NVIDIA driver installed on the host machine?

DavidPL1 commented 1 year ago

When the server docker container starts, it launches a virtual X display with number :1. It seems that on your machine this display id is already in use by nvidia.

I'm not sure if it is possible to change the display number nvidia is using, could you look that up? Otherwise I would have to patch the server in order for it to use a different one.

DavidPL1 commented 1 year ago

@Director-of-G I've patched the 2.1 server image accordingly. Please refer to this wiki section.

DavidPL1 commented 1 year ago
  1. Can we use GPU in our solution? Are there any limitations for GPU devices?

We now have the official confirmation for GPU support. For specs see #22