Closed anilkurkcu closed 2 years ago
Hello,
what you are looking for is called learning_starts
in the doc (for SAC/TD3/DDPG/TQC/DQN).
DQN has additional paratemers for the epsilon-greedy exploration, best is to look at the doc in that case.
Thank you for your answer. How about for A2C? I could not come across something similar to this in the docs.
How about for A2C? I could not come across something similar to this in the docs.
You should read more about A2C (we have some links in the doc). A2C is on-policy, it must use its current policy to collect the data, so it cannot have a purely random exploration phase (and it doesn't have a replay buffer).
I see. Then how does this algorithm decide upon the exploration/exploitation tradeoff?
How could I determine the number of exploration steps for a certain algorithm? I guess there should be a default number of timesteps for the random exploration phase, and what I would like to do is to increase this timestep amount.