Open MisakiTaro0414 opened 5 months ago
Thank you for your response!
In all our experiments, we didn't use the option is_double=True
, so it was not reported and described in the paper. In other words, only the first option was used in the following if
clause.
https://github.com/d-tiapkin/gflownet-rl/blob/434732044ffbadc7d4b585a2e04a1a047297d42c/hypergrid/algorithms/soft_dqn.py#L121-L135
Regarding the option is_double=True
, this option regulates a usage of the Double DQN heuristic (see e.g. https://arxiv.org/abs/1509.06461) adapted to the entropy-regularized setting. As I have already mentioned, we did not use it in our final experiments, so it was not described in our paper. In essence, instead of computing log-sum-exp, it utilizes the current policy (it is policy_sn
) and the value associated with this policy and target Q-value; the product with policy_sn
and subsequent torch.sum
is used to compute the expectation.
I hope this explanation will help you! If you have any other questions, please don't hesitate.
Dear Authors,
First of all, I am very thankful for your repository. I got confused about the correctness of implementation in one part. For
soft_dqn.py
, the variablevalid_v_target_next
is getting multiplied withpolicy_sn
in torch.sum module. According to my derivation, there should not be such kind of multiplication. Could you please point out how thispolicy_sn
comes into the equation. Thanks.