Open Su-study opened 5 months ago
If you do the training on a single machine without a cluster. It can take 2-3 days. The experiments demonstrated in the paper were conducted in a cluster that has 100 CPU cores. Each runs takes 4-5 hours for MLP architecture.
Thank you so much for your reply! I'm really eager to learn more about the following two questions: Is the stopping criterion for your training determined by the 'max_step' parameter in config
, or the other factors? Additionally, do you observe any unstable training effects throughout the training process?
Hi! I am attempting to reproduce your work with a Singularity setup and I have a few questions about your comments
If you do the training on a single machine without a cluster. It can take 2-3 days. The experiments demonstrated in the paper were conducted in a cluster that has 100 CPU cores. Each runs takes 4-5 hours for MLP architecture.
Hi! I am attempting to reproduce your work with a Singularity setup and I have a few questions about your comments
If you do the training on a single machine without a cluster. It can take 2-3 days. The experiments demonstrated in the paper were conducted in a cluster that has 100 CPU cores. Each runs takes 4-5 hours for MLP architecture.
- Does single machine without a cluster mean training using only the CPU? I.e., does cluster mean GPU?
- My server is equipped with a 48-core CPU and two A6000 (48GB for each) GPUs. When I run the RL process, it operates on the CPU, not the GPU (only one Python process is using about 1GB of memory on the GPU, but it doesn’t seem related to the core training process).
- All of RL process is running on CPU, not using GPU, is right thing? Or do I have to set other configs to run with GPU?
Hi even though we collected data on 100 CPU core, the bottleneck was still the data collection part. With that being said, although training on GPU would make the update of the models faster, the overall speed is still limited by data collection part. Transformer based policy might benefit from GPU training more than MLP though.
I used GPUtil to automatically find the available device, so if you run train.py
on a machine with GPUs, it should already use GPUs. See here https://github.com/Daffan/ros_jackal/blob/6935ee821fed2e23a6f440d07ea82137f180d168/train.py#L108
Hi! I am attempting to reproduce your work with a Singularity setup and I have a few questions about your comments
If you do the training on a single machine without a cluster. It can take 2-3 days. The experiments demonstrated in the paper were conducted in a cluster that has 100 CPU cores. Each runs takes 4-5 hours for MLP architecture.
- Does single machine without a cluster mean training using only the CPU? I.e., does cluster mean GPU?
- My server is equipped with a 48-core CPU and two A6000 (48GB for each) GPUs. When I run the RL process, it operates on the CPU, not the GPU (only one Python process is using about 1GB of memory on the GPU, but it doesn’t seem related to the core training process).
- All of RL process is running on CPU, not using GPU, is right thing? Or do I have to set other configs to run with GPU?
Hi even though we collected data on 100 CPU core, the bottleneck was still the data collection part. With that being said, although training on GPU would make the update of the models faster, the overall speed is still limited by data collection part. Transformer based policy might benefit from GPU training more than MLP though.
I used GPUtil to automatically find the available device, so if you run
train.py
on a machine with GPUs, it should already use GPUs. See herehttps://github.com/Daffan/ros_jackal/blob/6935ee821fed2e23a6f440d07ea82137f180d168/train.py#L108
Thank you for your immediate response, and sorry for bothering you but I have one more question.
In actor.py, on line 110, the initialize_policy function receives 'cpu' as the device argument.
https://github.com/Daffan/ros_jackal/blob/6935ee821fed2e23a6f440d07ea82137f180d168/actor.py#L110
Why did you explicitly set this argument to 'cpu'? Is this code snippet referring to the data collection part you mentioned? When I change it to 'cuda:0', it causes the following error.
>>>>>>>> Initializing the policy
>>>> Running on device cuda:0
>>>>>>>> Start training
>>>> Saving to logging/motion_control_continuous_laser-v0/TD3/2024_05_10_14_19/085e
>>>> initialized logging
>>>> Pre-collect experience
Traceback (most recent call last):
File "/home/msj/jackal_ws/src/ros_jackal/train.py", line 311, in <module>
train(env, policy, replay_buffer, config)
File "/home/msj/jackal_ws/src/ros_jackal/train.py", line 221, in train
collector.collect(n_steps=training_config['pre_collect'])
File "/home/msj/jackal_ws/src/ros_jackal/rl_algos/collector.py", line 180, in collect
print(o[0])
KeyError: 0
Hi! I am attempting to reproduce your work with a Singularity setup and I have a few questions about your comments
If you do the training on a single machine without a cluster. It can take 2-3 days. The experiments demonstrated in the paper were conducted in a cluster that has 100 CPU cores. Each runs takes 4-5 hours for MLP architecture.
- Does single machine without a cluster mean training using only the CPU? I.e., does cluster mean GPU?
- My server is equipped with a 48-core CPU and two A6000 (48GB for each) GPUs. When I run the RL process, it operates on the CPU, not the GPU (only one Python process is using about 1GB of memory on the GPU, but it doesn’t seem related to the core training process).
- All of RL process is running on CPU, not using GPU, is right thing? Or do I have to set other configs to run with GPU?
Hi even though we collected data on 100 CPU core, the bottleneck was still the data collection part. With that being said, although training on GPU would make the update of the models faster, the overall speed is still limited by data collection part. Transformer based policy might benefit from GPU training more than MLP though. I used GPUtil to automatically find the available device, so if you run
train.py
on a machine with GPUs, it should already use GPUs. See here https://github.com/Daffan/ros_jackal/blob/6935ee821fed2e23a6f440d07ea82137f180d168/train.py#L108Thank you for your immediate response, and sorry for bothering you but I have one more question.
In actor.py, on line 110, the initialize_policy function receives 'cpu' as the device argument.
https://github.com/Daffan/ros_jackal/blob/6935ee821fed2e23a6f440d07ea82137f180d168/actor.py#L110
Why did you explicitly set this argument to 'cpu'? Is this code snippet referring to the data collection part you mentioned? When I change it to 'cuda:0', it causes the following error.
>>>>>>>> Initializing the policy >>>> Running on device cuda:0 >>>>>>>> Start training >>>> Saving to logging/motion_control_continuous_laser-v0/TD3/2024_05_10_14_19/085e >>>> initialized logging >>>> Pre-collect experience Traceback (most recent call last): File "/home/msj/jackal_ws/src/ros_jackal/train.py", line 311, in <module> train(env, policy, replay_buffer, config) File "/home/msj/jackal_ws/src/ros_jackal/train.py", line 221, in train collector.collect(n_steps=training_config['pre_collect']) File "/home/msj/jackal_ws/src/ros_jackal/rl_algos/collector.py", line 180, in collect print(o[0]) KeyError: 0
Hi I was talking about training script. You can set the data collection part (actor.py
) to use GPU. There is no special reason to use CPU here. I set it to be CPU only because I used a cluster that only has CPU cores.
Hi, I'd like to know how long it takes to achieve the results mentioned in your article with the
yaml
under theconfigs
folder?