Improbable-AI / pql

Parallel Q-Learning: Scaling Off-policy Reinforcement Learning under Massively Parallel Simulation
MIT License
57 stars 3 forks source link

When do critic and actor updates take place? #1

Open kbkartik opened 8 months ago

kbkartik commented 8 months ago


I came across your paper and had some doubts. My goal is to use your results and analysis to train discrete SAC for parallel minigrid environments.

In, you have variables like critic_unit_time, critic_update_times, sim_unit_time, and counter[0]['critic']. How do these variables relate to the beta_a_v and beta_p_v ratios?

Suppose you have 128 envs with replay buffer size of 1e6, beta_a_v = 8, and beta_p_v = 2. Do you do beta_a_v updates to the critic and beta_p_v updates to the policy every iteration step (i.e. in 1 iteration step, all 128 envs will be executed)?

Thanks, kb

supersglzc commented 8 months ago


In config file pql/cfg/algo/pql_algo.yaml, there is one config called critic_sample_ratio. This critic_sample_ratio = 8 corresponds to beta_a_v = 1:8, which means, within a unit time, for every environment step we update the critic 8 times.

To achieve the above ratio, we want the wall-clock time of every critic update : wall-clock time of every data collection = 1:8, where the wall-clock time of every critic update = critic_unit_time = (time interval / number of critic updates within the interval). Time interval is computed by time.time() - counter[0]['time'] and number of critic updates within the interval is computed by (critic_update_times - counter[0]['critic']). Similar for wall-clock time of every data collection and wall-clock time of every policy update.

Yes, every iteration step/environment step, all 128 envs will be executed.