Training duration - Githubissues

Su-study commented 5 months ago

Hi, I'd like to know how long it takes to achieve the results mentioned in your article with the yaml under the configs folder?

Daffan commented 5 months ago

If you do the training on a single machine without a cluster. It can take 2-3 days. The experiments demonstrated in the paper were conducted in a cluster that has 100 CPU cores. Each runs takes 4-5 hours for MLP architecture.

Su-study commented 5 months ago

Thank you so much for your reply! I'm really eager to learn more about the following two questions: Is the stopping criterion for your training determined by the 'max_step' parameter in config , or the other factors? Additionally, do you observe any unstable training effects throughout the training process?

msjun23 commented 4 months ago

Hi! I am attempting to reproduce your work with a Singularity setup and I have a few questions about your comments

If you do the training on a single machine without a cluster. It can take 2-3 days. The experiments demonstrated in the paper were conducted in a cluster that has 100 CPU cores. Each runs takes 4-5 hours for MLP architecture.

Does single machine without a cluster mean training using only the CPU? I.e., does cluster mean GPU?
My server is equipped with a 48-core CPU and two A6000 (48GB for each) GPUs. When I run the RL process, it operates on the CPU, not the GPU (only one Python process is using about 1GB of memory on the GPU, but it doesn’t seem related to the core training process).
All of RL process is running on CPU, not using GPU, is right thing? Or do I have to set other configs to run with GPU?

Daffan commented 4 months ago

Hi! I am attempting to reproduce your work with a Singularity setup and I have a few questions about your comments

If you do the training on a single machine without a cluster. It can take 2-3 days. The experiments demonstrated in the paper were conducted in a cluster that has 100 CPU cores. Each runs takes 4-5 hours for MLP architecture.

Does single machine without a cluster mean training using only the CPU? I.e., does cluster mean GPU?

My server is equipped with a 48-core CPU and two A6000 (48GB for each) GPUs. When I run the RL process, it operates on the CPU, not the GPU (only one Python process is using about 1GB of memory on the GPU, but it doesn’t seem related to the core training process).

All of RL process is running on CPU, not using GPU, is right thing? Or do I have to set other configs to run with GPU?

Hi even though we collected data on 100 CPU core, the bottleneck was still the data collection part. With that being said, although training on GPU would make the update of the models faster, the overall speed is still limited by data collection part. Transformer based policy might benefit from GPU training more than MLP though.

I used GPUtil to automatically find the available device, so if you run train.py on a machine with GPUs, it should already use GPUs. See here https://github.com/Daffan/ros_jackal/blob/6935ee821fed2e23a6f440d07ea82137f180d168/train.py#L108

msjun23 commented 4 months ago

Hi! I am attempting to reproduce your work with a Singularity setup and I have a few questions about your comments

If you do the training on a single machine without a cluster. It can take 2-3 days. The experiments demonstrated in the paper were conducted in a cluster that has 100 CPU cores. Each runs takes 4-5 hours for MLP architecture.

Does single machine without a cluster mean training using only the CPU? I.e., does cluster mean GPU?

My server is equipped with a 48-core CPU and two A6000 (48GB for each) GPUs. When I run the RL process, it operates on the CPU, not the GPU (only one Python process is using about 1GB of memory on the GPU, but it doesn’t seem related to the core training process).

All of RL process is running on CPU, not using GPU, is right thing? Or do I have to set other configs to run with GPU?

Hi even though we collected data on 100 CPU core, the bottleneck was still the data collection part. With that being said, although training on GPU would make the update of the models faster, the overall speed is still limited by data collection part. Transformer based policy might benefit from GPU training more than MLP though.

I used GPUtil to automatically find the available device, so if you run train.py on a machine with GPUs, it should already use GPUs. See here

https://github.com/Daffan/ros_jackal/blob/6935ee821fed2e23a6f440d07ea82137f180d168/train.py#L108

Thank you for your immediate response, and sorry for bothering you but I have one more question.

In actor.py, on line 110, the initialize_policy function receives 'cpu' as the device argument.

https://github.com/Daffan/ros_jackal/blob/6935ee821fed2e23a6f440d07ea82137f180d168/actor.py#L110

스크린샷 2024-05-10 오후 2 53 12

Why did you explicitly set this argument to 'cpu'? Is this code snippet referring to the data collection part you mentioned? When I change it to 'cuda:0', it causes the following error.

>>>>>>>> Initializing the policy
    >>>> Running on device cuda:0
>>>>>>>> Start training
    >>>> Saving to logging/motion_control_continuous_laser-v0/TD3/2024_05_10_14_19/085e
    >>>> initialized logging
    >>>> Pre-collect experience
Traceback (most recent call last):
  File "/home/msj/jackal_ws/src/ros_jackal/train.py", line 311, in <module>
    train(env, policy, replay_buffer, config)
  File "/home/msj/jackal_ws/src/ros_jackal/train.py", line 221, in train
    collector.collect(n_steps=training_config['pre_collect'])
  File "/home/msj/jackal_ws/src/ros_jackal/rl_algos/collector.py", line 180, in collect
    print(o[0])
KeyError: 0

Daffan commented 4 months ago

Hi! I am attempting to reproduce your work with a Singularity setup and I have a few questions about your comments

If you do the training on a single machine without a cluster. It can take 2-3 days. The experiments demonstrated in the paper were conducted in a cluster that has 100 CPU cores. Each runs takes 4-5 hours for MLP architecture.

Does single machine without a cluster mean training using only the CPU? I.e., does cluster mean GPU?

My server is equipped with a 48-core CPU and two A6000 (48GB for each) GPUs. When I run the RL process, it operates on the CPU, not the GPU (only one Python process is using about 1GB of memory on the GPU, but it doesn’t seem related to the core training process).

All of RL process is running on CPU, not using GPU, is right thing? Or do I have to set other configs to run with GPU?

Hi even though we collected data on 100 CPU core, the bottleneck was still the data collection part. With that being said, although training on GPU would make the update of the models faster, the overall speed is still limited by data collection part. Transformer based policy might benefit from GPU training more than MLP though. I used GPUtil to automatically find the available device, so if you run train.py on a machine with GPUs, it should already use GPUs. See here https://github.com/Daffan/ros_jackal/blob/6935ee821fed2e23a6f440d07ea82137f180d168/train.py#L108

Thank you for your immediate response, and sorry for bothering you but I have one more question.

In actor.py, on line 110, the initialize_policy function receives 'cpu' as the device argument.

https://github.com/Daffan/ros_jackal/blob/6935ee821fed2e23a6f440d07ea82137f180d168/actor.py#L110

Why did you explicitly set this argument to 'cpu'? Is this code snippet referring to the data collection part you mentioned? When I change it to 'cuda:0', it causes the following error.
>>>>>>>> Initializing the policy
    >>>> Running on device cuda:0
>>>>>>>> Start training
    >>>> Saving to logging/motion_control_continuous_laser-v0/TD3/2024_05_10_14_19/085e
    >>>> initialized logging
    >>>> Pre-collect experience
Traceback (most recent call last):
  File "/home/msj/jackal_ws/src/ros_jackal/train.py", line 311, in <module>
    train(env, policy, replay_buffer, config)
  File "/home/msj/jackal_ws/src/ros_jackal/train.py", line 221, in train
    collector.collect(n_steps=training_config['pre_collect'])
  File "/home/msj/jackal_ws/src/ros_jackal/rl_algos/collector.py", line 180, in collect
    print(o[0])
KeyError: 0

Hi I was talking about training script. You can set the data collection part (actor.py) to use GPU. There is no special reason to use CPU here. I set it to be CPU only because I used a cluster that only has CPU cores.

Daffan / ros_jackal

Training duration #9