Improbable-AI / pql

Parallel Q-Learning: Scaling Off-policy Reinforcement Learning under Massively Parallel Simulation
MIT License
57 stars 3 forks source link

When do critic and actor updates take place? #1

Open kbkartik opened 8 months ago

kbkartik commented 8 months ago

Hi,

I came across your paper and had some doubts. My goal is to use your results and analysis to train discrete SAC for parallel minigrid environments.

In train_pql.py, you have variables like critic_unit_time, critic_update_times, sim_unit_time, and counter[0]['critic']. How do these variables relate to the beta_a_v and beta_p_v ratios?

Suppose you have 128 envs with replay buffer size of 1e6, beta_a_v = 8, and beta_p_v = 2. Do you do beta_a_v updates to the critic and beta_p_v updates to the policy every iteration step (i.e. in 1 iteration step, all 128 envs will be executed)?

Thanks, kb

supersglzc commented 8 months ago

Hi,

In config file pql/cfg/algo/pql_algo.yaml, there is one config called critic_sample_ratio. This critic_sample_ratio = 8 corresponds to beta_a_v = 1:8, which means, within a unit time, for every environment step we update the critic 8 times.

To achieve the above ratio, we want the wall-clock time of every critic update : wall-clock time of every data collection = 1:8, where the wall-clock time of every critic update = critic_unit_time = (time interval / number of critic updates within the interval). Time interval is computed by time.time() - counter[0]['time'] and number of critic updates within the interval is computed by (critic_update_times - counter[0]['critic']). Similar for wall-clock time of every data collection and wall-clock time of every policy update.

Yes, every iteration step/environment step, all 128 envs will be executed.