Fix off policy - Githubissues

This fixes https://github.com/GFNOrg/torchgfn/issues/168. The idea is to remove the arguments we had before off_policy and sample_off_policy, and be explicit about what we're evaluating and storing when sampling. When being on_policy, we should store the logprobs. This is the default. When being off_policy, with a tempered/modified PF, we should only store estimator_outputs. When we use a replay buffer, we don't need to store anything - we should recalculate the logprobs.

Additionally, this fixes FM + ReplayBuffer, that was broken before, because states extension didn't take into account the _log_probs attribute.

GFNOrg / torchgfn

Fix off policy #174