JuliaPOMDP / MCVI.jl

Other
5 stars 3 forks source link

Dust this off for coronavirus? #14

Open zsunberg opened 4 years ago

zsunberg commented 4 years ago

Hi all, I just talked to @andrewilyas who is working on a coronavirus decision support tool. He has gotten some traction with POMCPOW, but I think it would be even better if we could use MCVI to get a finite state machine.

Unfortunately this package is one of our dustiest. Do any ace julia programmers have some time to dust it off and maybe add multi-threading? @rejuvyesh @MaximeBouton @Shushman ? @mykelk is there anyone else who might want to do it?

The model has essentially continuous states, but I think it would be fine to discretize the action space, so this might be a very good solver.

Shushman commented 4 years ago

I should be able to, depending on the specific wishlist. Is there one?

zsunberg commented 4 years ago

Cool. Here is what I can think of:

  1. From the tests, it appears that bounds are involved, but they are not documented anywhere. Ideally you should be able to give bounds to the solver constructor as functions, that way you can experiment with different bounds.
  2. Add a minimal working example to the docs.
  3. Make sure it runs as expected with the actual problem. @andrewilyas, would you be willing to share your model code with us so that we can test it.
  4. Run a profiler to see if there are any major bottlenecks.
  5. if it is too slow, think about multi-threading.
Shushman commented 4 years ago

Ok. I may have spoken a bit too soon about my cycles. I sprained my right arm a couple of days ago and it's been annoying me longer than I expected. So I may not be able to type at a high frequency right now. I am happy to consult and make suggestions if there is someone who can type. This also might go away in a couple of days. However, because of the lockdown I can't get a scan to test how serious it is right now.

zsunberg commented 4 years ago

Ah, man, I never thought we would have to deal with injuries in programming, haha. I may also have a bit of time to take a look at it.

I thought of one more thing:

  1. Add a way to enable or disable the observation classifier
Shushman commented 4 years ago

Hehe you need a wider observation space ;) It's been improving rapidly so I may be able to start typing in the weekend. I'll spend some time familiarizing myself with the repo meanwhile.

Does Andrew have some documentation about the use case?

andrewilyas commented 4 years ago

Sorry to hear about that @Shushman, get well soon :( and make sure to rest it!

For the model we're working on I don't necessarily need the package to be all fixed up, as long as has the right functionality I'm happy to make it work (and I can try to speed it up to the best of my abilities).

The details of the use case are the following, let me know if I can provide any others:

andrewilyas commented 4 years ago

Also, I would appreciate advice on what the right way to implement the bounds is. In the tests they are written like:

function lower_bound(lb::LightDark1DLowerBound, p::LightDark1D, s::LightDark1DState)
    _, _, r = gen(DDNOut(:sp,:o,:r), p, s, init_lower_action(p), lb.rng)
    return r * discount(p)
end

function upper_bound(ub::LightDark1DUpperBound, p::LightDark1D, s::LightDark1DState)
    steps = abs(s.y)/p.step_size + 1
    return p.correct_r*(discount(p)^steps)
end

Can someone provide intuition for what these two functions are doing? Also, is using something naive like a random policy for lower bound, and R_max = 1 / (1 - gamma) for upper bound going to work?

Shushman commented 4 years ago

Thanks Andrew! It is feeling better so I can take a look later today or tomorrow. I'll have to reskim the MCVI paper to familiarize myself with some of the nuts and bolts. As for the bounds, I think the specific functions are somewhat domain-dependent, so @zsunberg may be best suited to explain them. In general, yes, it is quite common to use a random policy for a lower bound, while for the upper bound, we typically use QMDP or FIB to get a somewhat informed upper bound (they assume a simpler POMDP so their optimal solution is an upper bound). I think the upper bound you are suggesting should actually be R_max / (1 - gamma) to be an upper one (albeit less informed).

andrewilyas commented 4 years ago

Good to hear! And thanks so much.

Yep you're right, I have an = instead of a * in my above comment :)

@zsunberg is this the "assume beliefs are states and solve the MDP" thing you were explaining to me earlier?

zsunberg commented 4 years ago

I'm not as familiar with this algorithm, so I will need to dig deeper before I'm sure exactly what the bounds mean (maybe tonight).

The QMDP bound that Shushman mentions is indeed the solve the MDP upper bound that I mentioned.

However, I think these bounds are different. They appear to be only functions of the state.

Here are my thoughts:

Here is what I would recommend to start with: Lower Bound: best analytical lower bound that you can come up with for using the "always lock down" policy from the current state Upper Bound: Imagine that there is a magical action that has the economic cost of the most permissive action, but the health effectiveness of the complete lockdown. Use an analytical estimate of the value of always taking this action from the current state.

Shushman commented 4 years ago

Ok so I'm feeling better, and maybe, for now, we should prioritize point 3 from the initial checklist of @zsunberg ? In that case, @andrewilyas could you tell me the specific issue you've been facing to get MCVI to work on your problem? I see that you have used POMCPOW to some extent, so you do have the POMDP model defined already. What's the issue running MCVI then? Is it not giving you a useful solution?

Shushman commented 4 years ago

A brief commentary on the bounds:

From Sec 3.3 of the original paper, the lower bound is computed from the policy graph....always performing a single fixed action. Furthermore, to obtain the initial upper bound....standard relaxation techniques...assuming fully observable, we can solve a corresponding MDP, whose value function provides an upper bound... They do use the same bounds that we talked about (MCVI is largely derived from SARSOP).

The functions in MCVI.jl are state-dependent because (also from the paper) the upper and lower bounds in our algorithm are obtained via sampling and MC simulations, and are thus approximate. Therefore, the bounds are defined in terms of the particle states that comprise the belief. We can see above from @andrewilyas ' post that the lower_bound function samples the effect of init_lower_action for LightDark1D on a single state (although I'm not sure why it is multiplied by the discount; that might be domain-dependent logic). The upper_bound appears to also be domain-dependent in its logic, but you can see how it is computing steps based on information in the s state particle, and then the bound is based on some function of steps and the correct_r whatever that means.

So, in summation, this might just be a long-winded way of seconding what Zack said in his most recent post 😄

zsunberg commented 4 years ago

Right now it takes 76 seconds to solve the light-dark problem on my machine - I bet we could speed it up a lot!