Closed wayunderfoot closed 4 years ago
I am very glad you are interested in my solution! In n-steps sac critic loss entropy is treat as if it regular reward component. For example, in state s actions (a_0, a_1, ..., a_n) was sampled during exploration and corresponding log-probabilities was (p_0, p_1, ..., p_n), scalar product R_sum of log-probs (starting from second term) with discounting vector (g, g 2, ..., g n) calculated and target for Q(s, a_0) is R_sum + Q(s_n, a_n)
Hello,i am learning model-free reinforcement learning.A few days ago i implement it using single expericent tuple, but the effect is not very good,i would appreciate it if you tell me how to implement n-step sac specifically how to store expericece in replay buffer and how to sample from it.Also which way do you use in your per,rank-based or propotional based?
Sorry for the delayed responce. The repository will be updated soon, and all code will be available within one or two weeks. We used same prioritization as in R2D2 paper, proportional that is.
Thanks for your help
I am quite interested in your algorithm and could you tell me how to consider entropy in n-steps sac,thanks.