Open zhihanyang2022 opened 3 years ago
I think they only gave the implementation of CQL(H). In their code base, the min_q_version
is always set to 3, which corresponds to CQL(H). The equation with log-sum-exp is present in Appendix F (Additional Experimental Setup and Implementation Details).
Equation 7 missed some item, i.e. the KL-divergence, after adding this item any you can deduce logsumexp
This is a question regarding how CQL(rho) works in terms of code 😊.
In the CQL section (starting from line 235) within
/CQL/d4rl/rlkit/torch/sac/cql.py
, we first computed:and then used them to compute
I'm a bit confused about why the Q values of actions drawn from three distinct distributions can be used to compute this quantity:
q1_rand
: uniform distributionq1_pred
: dataset distributionq1_curr_actions
andq1_next_actions
: last-iteration policyHere are my questions:
I'm able to completely understand how CQL(H) works in the codebase though.