Implementation correctness

Thank you for your response! In all our experiments, we didn't use the option is_double=True, so it was not reported and described in the paper. In other words, only the first option was used in the following if clause. https://github.com/d-tiapkin/gflownet-rl/blob/434732044ffbadc7d4b585a2e04a1a047297d42c/hypergrid/algorithms/soft_dqn.py#L121-L135

Regarding the option is_double=True, this option regulates a usage of the Double DQN heuristic (see e.g. https://arxiv.org/abs/1509.06461) adapted to the entropy-regularized setting. As I have already mentioned, we did not use it in our final experiments, so it was not described in our paper. In essence, instead of computing log-sum-exp, it utilizes the current policy (it is policy_sn) and the value associated with this policy and target Q-value; the product with policy_sn and subsequent torch.sum is used to compute the expectation.

I hope this explanation will help you! If you have any other questions, please don't hesitate.

d-tiapkin / gflownet-rl

Implementation correctness #1