Open zsunberg opened 4 years ago
I should be able to, depending on the specific wishlist. Is there one?
Cool. Here is what I can think of:
Ok. I may have spoken a bit too soon about my cycles. I sprained my right arm a couple of days ago and it's been annoying me longer than I expected. So I may not be able to type at a high frequency right now. I am happy to consult and make suggestions if there is someone who can type. This also might go away in a couple of days. However, because of the lockdown I can't get a scan to test how serious it is right now.
Ah, man, I never thought we would have to deal with injuries in programming, haha. I may also have a bit of time to take a look at it.
I thought of one more thing:
Hehe you need a wider observation space ;) It's been improving rapidly so I may be able to start typing in the weekend. I'll spend some time familiarizing myself with the repo meanwhile.
Does Andrew have some documentation about the use case?
Sorry to hear about that @Shushman, get well soon :( and make sure to rest it!
For the model we're working on I don't necessarily need the package to be all fixed up, as long as has the right functionality I'm happy to make it work (and I can try to speed it up to the best of my abilities).
The details of the use case are the following, let me know if I can provide any others:
Also, I would appreciate advice on what the right way to implement the bounds is. In the tests they are written like:
function lower_bound(lb::LightDark1DLowerBound, p::LightDark1D, s::LightDark1DState)
_, _, r = gen(DDNOut(:sp,:o,:r), p, s, init_lower_action(p), lb.rng)
return r * discount(p)
end
function upper_bound(ub::LightDark1DUpperBound, p::LightDark1D, s::LightDark1DState)
steps = abs(s.y)/p.step_size + 1
return p.correct_r*(discount(p)^steps)
end
Can someone provide intuition for what these two functions are doing? Also, is using something naive like a random policy for lower bound, and R_max = 1 / (1 - gamma) for upper bound going to work?
Thanks Andrew! It is feeling better so I can take a look later today or tomorrow. I'll have to reskim the MCVI paper to familiarize myself with some of the nuts and bolts. As for the bounds, I think the specific functions are somewhat domain-dependent, so @zsunberg may be best suited to explain them. In general, yes, it is quite common to use a random policy for a lower bound, while for the upper bound, we typically use QMDP or FIB to get a somewhat informed upper bound (they assume a simpler POMDP so their optimal solution is an upper bound). I think the upper bound you are suggesting should actually be R_max / (1 - gamma) to be an upper one (albeit less informed).
Good to hear! And thanks so much.
Yep you're right, I have an = instead of a * in my above comment :)
@zsunberg is this the "assume beliefs are states and solve the MDP" thing you were explaining to me earlier?
I'm not as familiar with this algorithm, so I will need to dig deeper before I'm sure exactly what the bounds mean (maybe tonight).
The QMDP bound that Shushman mentions is indeed the solve the MDP upper bound that I mentioned.
However, I think these bounds are different. They appear to be only functions of the state.
Here are my thoughts:
Here is what I would recommend to start with: Lower Bound: best analytical lower bound that you can come up with for using the "always lock down" policy from the current state Upper Bound: Imagine that there is a magical action that has the economic cost of the most permissive action, but the health effectiveness of the complete lockdown. Use an analytical estimate of the value of always taking this action from the current state.
Ok so I'm feeling better, and maybe, for now, we should prioritize point 3 from the initial checklist of @zsunberg ? In that case, @andrewilyas could you tell me the specific issue you've been facing to get MCVI to work on your problem? I see that you have used POMCPOW to some extent, so you do have the POMDP model defined already. What's the issue running MCVI then? Is it not giving you a useful solution?
A brief commentary on the bounds:
From Sec 3.3 of the original paper, the lower bound is computed from the policy graph....always performing a single fixed action
. Furthermore, to obtain the initial upper bound....standard relaxation techniques...assuming fully observable, we can solve a corresponding MDP, whose value function provides an upper bound..
. They do use the same bounds that we talked about (MCVI is largely derived from SARSOP).
The functions in MCVI.jl are state-dependent because (also from the paper) the upper and lower bounds in our algorithm are obtained via sampling and MC simulations, and are thus approximate
. Therefore, the bounds are defined in terms of the particle states that comprise the belief. We can see above from @andrewilyas ' post that the lower_bound
function samples the effect of init_lower_action
for LightDark1D
on a single state (although I'm not sure why it is multiplied by the discount; that might be domain-dependent logic). The upper_bound
appears to also be domain-dependent in its logic, but you can see how it is computing steps
based on information in the s
state particle, and then the bound is based on some function of steps
and the correct_r
whatever that means.
So, in summation, this might just be a long-winded way of seconding what Zack said in his most recent post 😄
Right now it takes 76 seconds to solve the light-dark problem on my machine - I bet we could speed it up a lot!
Hi all, I just talked to @andrewilyas who is working on a coronavirus decision support tool. He has gotten some traction with POMCPOW, but I think it would be even better if we could use MCVI to get a finite state machine.
Unfortunately this package is one of our dustiest. Do any ace julia programmers have some time to dust it off and maybe add multi-threading? @rejuvyesh @MaximeBouton @Shushman ? @mykelk is there anyone else who might want to do it?
The model has essentially continuous states, but I think it would be fine to discretize the action space, so this might be a very good solver.