Prioritized replay buffer

I've adde a prioritized replay buffer. This:

Only adds examples if their reward is larger than the min reward found in the buffer.
Only adds examples if they are unique (by default).

In general, uniqueness is defined as distances between the candidate batch and buffer states using a p_norm -- this is configurable by the user. The default settings use a p-norm of 1 and a distance threshold of 0, i.e., the added states should not be identical to any state already in the buffer.

Important you can test this using

tutorials/examples/train_hypergrid.py --replay_buffer_size 1000 --replay_buffer_prioritized

Note: currently, the standard buffer outperforms the prioritized buffer using these default settings!

Standard: 'loss': 9.107343066716567e-05, 'states_visited': 998416, 'l1_dist': 0.00023296871222555637, 'logZ_diff': 0.001130819320678711
Prioritized: 'loss': 0.0003514138516038656, 'states_visited': 998416, 'l1_dist': 0.00017267849761992693, 'logZ_diff': 0.0020639896392822266

In the debugger, I could determine that no samples were ever added to the buffer after it was originally filled, because the states were not found to be unique. I.e., in replay_buffer.py the following logic always had idx_batch_buffer as completely full of False:

    # Remove non-diverse examples according to the above distances.
    idx_batch_batch = batch_batch_dist > self.cutoff_distance
    idx_batch_buffer = batch_buffer_dist > self.cutoff_distance
    idx_diverse = idx_batch_batch & idx_batch_buffer

@saleml I'd be curious to get your opinion on this. Perhaps we can tweak the implementation of the prioritized replay buffer, or perhaps this should be expected behaviour for this relatively simple example. I am not sure.

GFNOrg / torchgfn

Prioritized replay buffer #175