Daffan / ros_jackal

ROS-Jackal environment for RL
MIT License
28 stars 6 forks source link

Is it possible to train in singularity container? #8

Closed J1dan closed 5 months ago

J1dan commented 6 months ago

I tried to build the image using the Singularityfile.def file, producing a nav_competetion_image.sif, but with the command ./singularity_run.sh ./local_buffer/nav_competition_image.sif python train.py --config configs/e2e_default_TD3.yaml , the printed info is as followed:

(base) jidan@jidan:~/AVWorkSpace/jackal_ws/src/ros_jackal$ ./singularity_run.sh ./nav_competition_image.sif python train.py --config configs/e2e_default_TD3.yaml /venv/lib/python3.6/site-packages/gym/core.py:27: UserWarning: WARN: Gym minimally supports python 3.6 as the python foundation not longer supports the version, please update your version to 3.7+ "Gym minimally supports python 3.6 as the python foundation not longer supports the version, please update your version to 3.7+"

Loading the configuration from configs/e2e_default_TD3.yaml Creating the environments /venv/lib/python3.6/site-packages/gym/spaces/box.py:127: UserWarning: WARN: Box bound precision lowered by casting to float32 logger.warn(f"Box bound precision lowered by casting to {self.dtype}") Initializing the policy Running on device cuda:0 /venv/lib/python3.6/site-packages/torch/cuda/init.py:143: UserWarning: NVIDIA GeForce RTX 3050 Ti Laptop GPU with CUDA capability sm_86 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70. If you want to use the NVIDIA GeForce RTX 3050 Ti Laptop GPU GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))

Start training Saving to logging/motion_control_continuous_laser-v0/TD3/2024_03_19_12_40/9933 initialized logging Pre-collect experience multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker result = (True, func(*args, *kwds)) File "/usr/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar return list(map(args)) File "/jackal_ws/src/ros_jackal/rl_algos/collector.py", line 27, in run_actor_in_container options=["-i", "-n", "--network=none", "-p"], nv=True File "/venv/lib/python3.6/site-packages/spython/main/execute.py", line 116, in execute environ=environ, File "/venv/lib/python3.6/site-packages/spython/main/base/command.py", line 142, in run_command background=background, File "/venv/lib/python3.6/site-packages/spython/utils/terminal.py", line 197, in run_command process = subprocess.Popen(cmd, stderr=subprocess.PIPE, stdout=stdout, env=environ) File "/usr/lib/python3.6/subprocess.py", line 729, in init restore_signals, start_new_session) File "/usr/lib/python3.6/subprocess.py", line 1364, in _execute_child raise child_exception_type(errno_num, err_msg, err_filename) FileNotFoundError: [Errno 2] No such file or directory: 'singularity': 'singularity' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "train.py", line 311, in train(env, policy, replay_buffer, config) File "train.py", line 221, in train collector.collect(n_steps=training_config['pre_collect']) File "/jackal_ws/src/ros_jackal/rl_algos/collector.py", line 178, in collect output = p.map(run_actor_in_container, self.ids) File "/usr/lib/python3.6/multiprocessing/pool.py", line 266, in map return self._map_async(func, iterable, mapstar, chunksize).get() File "/usr/lib/python3.6/multiprocessing/pool.py", line 644, in get raise self._value FileNotFoundError: [Errno 2] No such file or directory: 'singularity': 'singularity'

I am new to singularity container, can someone gives me a hint? I also want to know the difference between the image built by the Singularityfile.def and the image pulled using the command, and how they coordinate. Thanks a lot.

Daffan commented 5 months ago

Hi, you need to run python train.py --config configs/e2e_default_TD3.yaml locally outside the Singularity container. This training script will invoke Singularity container that only load the latest policy and collect rollout trajectories from Gazebo simulated in the container.

J1dan commented 5 months ago

I see, thank you