Closed Kaixhin closed 8 years ago
Update: Test on Frostbite has only achieved DQN/DDQN scores, despite rank-based prioritised experience replay achieving much higher scores. Therefore the current implementation is either wrong or suboptimal. Proportional prioritised replay still needs to be implemented.
Hi, What's the current progress with prioritized experience replay implementation? Has it been successfully tested yet? I noticed it was striked through in the readme. Thanks!
As above, I attempted an implementation of rank-based prioritised experience replay, but it got nowhere near the scores it should have when I ran a test on Frostbite. The paper helpfully provides complete training graphs for comparison, so clearly there's a bug somewhere. Proportional prioritised experience replay has not been attempted yet since it is trickier to get right.
Ah, I see. Any ideas on where the bug could be? I could help take a look as well. I've been trying to find prioritized experience replay, so if you know of any other implementations available publicly, that'd be great! thanks again.
For rank-based, first you need a priority queue, which I've implemented as a max binary heap. I did quick tests halfway through implementation, but checking that would be first.
Apart from the priority (absolute TD-error) being the key, the value is actually an index which indicates the location of the sample in the experience replay buffer. So first comes checking the new sample update and the sample update after learning. The actual algorithm/maths to compare against the paper is here. The logic for picking the partitions for sampling is here.
There are a few more bits, but those are the important ones to try first. I've looked a lot, but this does seem to be the only public attempt - so no chance of porting code from others!
Just a quick glance to see if anything jumps out at my uneducated fresh eyes. Haven't used BinaryHeap structures enough to know if this is something or not...
https://github.com/Kaixhin/Atari/blob/master/structures/BinaryHeap.lua#L116
Would having the recursion inside those conditions mean it only continues rebalancing the heap if it is currently working on an item which is in the wrong position? It stops soon as it hits an item which doesn't need moving, intended? They're already in order up to this point?
@mryellow Yes that is the case - one of the ideas behind this data structure is that if an element is inserted (or replaced - the less straightforward case), it will either travel up or down (or stay) along its path in the tree (a heap is a specialised tree). So going from an empty or ordered tree, each operation only needs to deal with the element in question.
Tests would be best, but bar that referring to pseudocode/code in other languages would be one way of checking this. One error I might have made is in converting from the usual 0-based indexing to 1-based indexing.
each operation only needs to deal with the element in question
Seemed that was the idea, my formal CS is weak.
One error I might have made is in converting from the usual 0-based indexing to 1-based indexing.
That was one area which stood out, the bumping up/down of size
. Could easily be a missing one somewhere, either in an assignment or a condition. I'll do some logging sometime through the week and see if anything jumps out.
In BinaryHeap
should the parent priority always be less than both children?
https://github.com/mryellow/Atari/blob/test-prior/spec/structures.lua#L140-L151
I'm failing those after insert
calls, however after calls to updateByVal
everything is sorted.
@mryellow In a max heap (pictured in Wikipedia) the parent should always be greater than both children. Inserting or replacing will almost certainly mess up the order, which is why calling upHeap
on an insert and both downHeap
and upHeap
on a replace should fix the heap ordering.
Yeah that's checking out, with "greater than or equal to".
With indices, ISWeights = experience:sample(1)
, the last indices
corresponds to the highest ISWeights
, while this index should also point to the highest priority in the sample?
Experience:sample()
doesn't take any arguments (number of samples returned is hard-coded as minibatch size). If I remember correctly, rankIndices
returns indices within the heap, descending in priority. The hash table is then used to retrieve the actual index in the replay memory. The ISWeights
should be inversely proportional to the priority (to counteract for high priority samples being reused a lot).
In the paper they mention that they use the order in the underlying array as a proxy for the priority (noted in line 218, so the final elements might not necessarily be in priority order.
The ISWeights should be inversely proportional to the priority
Will have to check further but that part might be failing in the end.
If I fill the heap via Experience:store
, then update all the priorities using updateByVal
I end up with this:
local indices, ISWeights = experience:sample()
for i = 1, indices:size(1) do
print("i: " .. i)
print("indices[i]: " .. indices[i])
print("ISWeight: " .. ISWeights[i])
print("priority: " .. experience.priorityQueue.array[indices[i]][1])
print("val: " .. experience.priorityQueue.array[indices[i]][2])
end
i: 1
indices[i]: 5136
ISWeight: 0.10489251370625
priority: 1.09
val: 4
i: 2
indices[i]: 650
ISWeight: 0.25466125260447
priority: 1.467
val: 83
i: 3
indices[i]: 5395
ISWeight: 0.3869674701564
priority: 1.272
val: 347
i: 4
indices[i]: 1517
ISWeight: 0.48500964026323
priority: 1.294
val: 751
i: 5
indices[i]: 5400
ISWeight: 0.5745158898887
priority: 1.311
val: 1340
i: 6
indices[i]: 3590
ISWeight: 0.65013024797926
priority: 1.204
val: 2045
i: 7
indices[i]: 176
ISWeight: 0.71405114882925
priority: 1.487
val: 2818
i: 8
indices[i]: 4633
ISWeight: 0.82566234942261
priority: 1.226
val: 4630
i: 9
indices[i]: 2934
ISWeight: 0.88447589334189
priority: 1.141
val: 5858
i: 10
indices[i]: 2231
ISWeight: 1
priority: 1.437
val: 8913
It looks fine to me - I just printed rankIndices
, self.indices
and w
on line 244. rankIndices
is what indexes into experience.priorityQueue.array
- indices
is the set of indices for the experience replay buffer itself (e.g. for self.states
).
Ahh I was using the wrong index. Priorities mostly checking out when using rankIndex
inside sample
.
Although items are out-of-order until updateByVal
is called, while then still a few items out towards the end:
After store
, which sets to findMax
:
priority: 1.0076759999994
priority: 1.0079299999993
priority: 1.0062629999995
priority: 1.0053179999996
priority: 1.0081579999993
priority: 1.0070929999994
priority: 1.0050279999996
priority: 1.0080359999993
priority: 1.0034719999997
priority: 1.000243
After updateByVal
, set to 1 + math.random(500)/1000
:
priority: 1.5
priority: 1.498
priority: 1.494
priority: 1.452
priority: 1.469
priority: 1.396
priority: 1.321
priority: 1.172
priority: 1.24
priority: 1.151
Guess there is no guarantee that the items towards end will be the lowest magnitude.
So a check would be to make sure updateByVal
is called when needed, but it seems that is the case. And yes, items towards the end are generally lower priority but won't be ordered properly. Thanks a lot for running these tests - it looks like the low-level mechanics seem to be working fine so far.
Although it's unlikely to show anything, it might be worth halving smallConst
and testing to see if that has an impact on Catch? I'll see if I can also try that on Frostbite soon.
Thanks a lot for running these tests
Idiots make the best testers, hopefully my rubber-duck ignorance hits some gold sooner or later ;-)
I'm maxing out my RAM and running on CPU rather than GPU otherwise I'd kick off some agents, maybe it's all actually working.
called when needed, but it seems that is the case
Could there be potential for it getting stuck visiting the same experiences, thus only updating those parts of the heap when updatePriorities
is called in Agent
and never calling updateByVal
for some (or many) of them? Leaving parts of the heap broken, enough that they don't fit into stratum in a way which sees them sampled in future (never having both upHeap
and downHeap
called).
If only idiots made the best developers too... But really this really could benefit from a fresh pair of eyes.
That problem should only happen if the heap is broken in the first place - only inserted or updated elements need to be fixed (the calls to upHeap
and downHeap
should fix anything else that breaks as a result of them being called). So does using priorityQueue:insert
to build the heap fail to preserve the heap property in your tests?
So does using priorityQueue:insert to build the heap fail to preserve the heap property in your tests?
Well it only gets to looking at the parent for an upHeap
, which will push the greatest all the way to the top, but may leave other children of that now swapped parent waiting for a downHeap
. Guess long as the sampling hits indexes from everywhere in the heap then they'll all eventually get an updateByVal
and hence downHeap
.
which will push the greatest all the way to the top
Testing the index after each insert shows new max going to first position and findMax
works after a bunch of random inserts or updates.
Ah that's not an issue - check out how insert works in a binary heap. I'm fairly confident that the priority queue is fine now, so I've started going over the paper and my code more thoroughly now. Made some changes, have a few more to go, and will hopefully get to start a test on Frostbite by the weekend.
Missing implementation detail to do - the heap-based array "is infrequently sorted once every 10^6 steps to prevent the heap becoming too unbalanced".
Also, in line 10 of algorithm 1 it is unclear whether the max IS weight is calculated over all samples or just the ones that were chosen.
Once again, my math is weak but noob eyes might spot something and can read it well enough to fall over trying.
Where does subscript i
come from? Could that be indicating "all samples"? However then w
is defined right there, only for experiences in the batch, so surely not already full of all possible weights.
Finding the lowest ISWeights
would be easy enough via findMax()
, but you'd want the opposite. They don't mention pre-computing the weights and storing them, if it were the overall max I'd imagine they'd be stored much like the priorities with the ability to recall the max without having to calculate it by looking at the whole list.
It's a bit ambiguous, but yes it could well be over just the batch. I printed out the difference between the full sample max and the batch max for a while, with the max difference being ~0.05. I don't think that should make much of a difference, so I'll switch back to batch max for now. Worked out a simple trick for getting the full sample max IS weight here - it comes from the smallest probability, which is the last element in the probability distribution.
I'm hoping that rebalancing the heap is the last part of the puzzle. A quick guess is that sorting the heap's underlying array and reconstructing the heap from that might just do the trick.
I was just made aware of your effort to reproduce our work and combine all the best bits, very cool, I'm very happy to see that!
We tried to put as much information into the paper as possible, but if we missed something, feel free to ask.
Concretely:
@schaul Hi, thanks for your great work!
I have a few questions about priority experience replay as well. If you have time, could you answer to all/some of them? Thank you!
1)Does the constant assigned to new transitions matter? I have a 1000 for example and all my errors are clipped to -1, 1 before square inside the error.
2)I see jumps in my error values after each max-heap sorting, is this normal?//Please, see edit.
3)How often do I need to check if heap is unbalanced or not?
4)How do you remove elements when the heap is full? Currently I just remove the last element in the heap array.
5)to update the heap in batch, do you need to keep a dictionary from element id to its current position? Or you used some other method?
6)How bad is for beta to reach 1.0 too soon? Before the end of learning. alpha stays the same.
7)I train my network every 4 added transitions and my minibatch size is 32. So there is 4 new and unvisited transitions in the heap for every batch update. Should I increase minibatches or decrease 4 to 2 or 1 or leave it be?
I don't participate in this repository, but I hope my questions will be helpful to anyone who was trying to find a priority experience replay implementation and ended up here :)
Edit: jumps occur not after sorting, but after recomputing boundaries while heap is not full.
@avasilkov I'll attempt to answer a few questions based on what I've read in the paper/my implementation in this repo.
@Kaixhin Thank you for your answers! I really like your idea of uniting all these amazing papers into one crazy network!
What do you think about persistent advantage learning combined with double dqn? I'm not sure how to combine them, do they play along well?
2)About my second question, I asked it wrong. jumps happen while heap is not full and I call sort+recompute segment boundaries. So it doesn't happen after memory is full, but after each boundary recompute there are jumps: error picture, imgur Maybe it's because my replay memory doesn't serve transitions that were added between boundaries computations and the network sees a lot of new errors? But then it should get back to normal pretty quickly..
3)
According to the appendix the heap is rebalanced every 1m steps for the Atari domain.
Yes, I know, but @schaul wrote
the heap can become rather imbalanced, check this and rebalance if necessary
Currently I'm rebalancing it every heap_size(800k) transitions.
4) So in your implementation it's FIFO order, and even if the oldest transition still has a large error, you replace it with new?
6) But what if I don't know when the end will be? I don't train it on atari. How do you anneal it without knowing for how many steps you are going to anneal it? 7) I have one question that is not about experience replay.. should I change target network update from 10000 steps to 40000 steps because I train my network only every 4 steps?
Thanks again!
Edit: the error pic is averaged batch error taken every 500 steps.
@avasilkov According to one of the authors, PAL does not work well with double Q-learning (see notes in code).
Thank you, @Kaixhin ! I guess I will go and try FIFO order for removing. Maybe because of my way of removing elements I have stale transitions that only do harm. And thanks for the suggestion about not annealing beta parameter.
how about https://docs.python.org/2/library/queue.html#Queue.PriorityQueue ? :-)
and how about Python? at least the array index there is 0-based :-)
I read this thread, and found it seems you are struggling to get the low level code work correct.
From a language perspective, Python is a better than Lua; on top of that Python has much much bigger communities, and (standard) libraries. So you do not need to worry about such kind low level code, or bug that hinder your main work on machine learning.
As you mentioned, probably this is the only public attempt to prioritised experience replay. It's a pity to see that your progress are blocked by such low level issues.
I have a suggestion, how about joining force with others? e.g.
https://github.com/tambetm/simple_dqn
to "stand on the shoulders of giants" :-)
That Python code is well written, and easy to understand / extend.
I'd really like to see the relatively small ML community can work together to make joint effort, and make one good & complete implementation, instead of many fragmented ones.
Just my 2 cents.
@mingwugmail I don't want to start a debate about the pros and cons of Torch vs Theano, but some points relevant to this repository:
I just want to provide more info:
simple_dqn is based on Neon not Theano, and from it's doc
http://www.nervanasys.com/deep-reinforcement-learning-with-neon/
""" I chose to base it on Neon, because it has
-- the fastest convolutional kernels, -- all the required algorithms implemented (e.g. RMSProp), -- a reasonable Python API. """
and convnets benchmarks
https://github.com/soumith/convnet-benchmarks
Mature library + fastest implementation all will speed up your experiment.
I have a funny non-stationary domain which gives prioritised replay a bit of a workout. No real results to compare against but have noticed the behaviour change.
There are 3 actions, left
, right
, noop
. During training without prioritised replay the noop
will sit very close to random (33%), decreasing as epsilon
progresses. However with prioritised replay it will jump all over the place, for example per 10k actions: 17.1%, 33.2%, 8.8%, 17.8%, 65.2% etc. During evaluation it settles down a bit and ends up just above 0% (dropout doubling seems to be pushing other actions up above noop reward).
@mryellow Actually @lake4790k spotted an issue to do with indexing from the priority queue to the experience replay buffer - I've only had time to throw together a quick patch on a new per
branch and set off one test, but best to look at that branch for now.
I've also now spotted another issue, which is that terminal states shouldn't be stored in the queue at all. In per
I've given them a priority of 0, which I don't think should make too much of a difference, but I'll need to correct this eventually too.
Edit 2016-06-04: First test on the per
branch with Frostbite looks to be successful (see below). A working version of prioritised experience replay will be merged in after refactor
is finished.
Requires a "sum tree" binary heap for efficient execution.
Edit 2016-06-02: Please keep to the contributing guidelines.