Open kbkartik opened 8 months ago
Hi,
In config file pql/cfg/algo/pql_algo.yaml
, there is one config called critic_sample_ratio
. This critic_sample_ratio = 8
corresponds to beta_a_v = 1:8, which means, within a unit time, for every environment step we update the critic 8 times.
To achieve the above ratio, we want the wall-clock time of every critic update : wall-clock time of every data collection = 1:8, where the wall-clock time of every critic update = critic_unit_time = (time interval / number of critic updates within the interval). Time interval is computed by time.time() - counter[0]['time']
and number of critic updates within the interval is computed by (critic_update_times - counter[0]['critic'])
. Similar for wall-clock time of every data collection and wall-clock time of every policy update.
Yes, every iteration step/environment step, all 128 envs will be executed.
Hi,
I came across your paper and had some doubts. My goal is to use your results and analysis to train discrete SAC for parallel minigrid environments.
In
train_pql.py
, you have variables like critic_unit_time, critic_update_times, sim_unit_time, and counter[0]['critic']. How do these variables relate to the beta_a_v and beta_p_v ratios?Suppose you have 128 envs with replay buffer size of 1e6, beta_a_v = 8, and beta_p_v = 2. Do you do beta_a_v updates to the critic and beta_p_v updates to the policy every iteration step (i.e. in 1 iteration step, all 128 envs will be executed)?
Thanks, kb