SciML / JumpProcesses.jl

Build and simulate jump equations like Gillespie simulations and jump diffusions with constant and state-dependent rates and mix with differential equations and scientific machine learning (SciML)
https://docs.sciml.ai/JumpProcesses/stable/
Other
135 stars 35 forks source link

Interface for Jump simulations seems non-ideally messy? #360

Open TorkelE opened 9 months ago

TorkelE commented 9 months ago

Currently, to make a GIllespie simulation, I first make a DiscreteProblem, and then a JumpProblem, and then simulate it. For almost all other types of simulations, I just make a single problem. I think the reason was that (long time ago) the idea was to neatly combine Gillespie simulations with ODEs? I cannot speak for everyone, but my impression is that this Jump/ODE combination is very rare, compared to people making pure Gillespie simulations. Another oddity is that what I would think is the main solver (e.g. Direct()) is supplied to the JumpProblem, and not like in almost all other cases, to the solve command (there are some technical reasons for this though, with what information needs to be created or something?). Finally, I have never used anything but SSAStepper() in `solve. Is there a reason that this is not a default argument, given how common it is?

It seems like the ideal workflow would be (here using Catalyst):

rn = @reaction_network begin ... end
u0 = [...]
tspan = (...)
p = [...]

dprob = JumpProblem(rn, u0, tspan, p)
sol = solve(jprob, Direct())

(possibly with some automatic selection thing, making Direct() option)

I imagine that all of this might not be directly possible (and as changes, this is very breaking). However, is there really a good reasons for all of these peculiarities in the Gillespie simulation set-up?

(This is something that comes up whenever I show people how to do systems bio in Julia. "This is how to do ODE", "This is how to do SDE", and "Now we are going to do Gillespie, but we have to make these couple of modifications. No, I have no idea why we do this, it just needs to be done")

ChrisRackauckas commented 9 months ago

I think the reason was that (long time ago) the idea was to neatly combine Gillespie simulations with ODEs? I cannot speak for everyone, but my impression is that this Jump/ODE combination is very rare, compared to people making pure Gillespie simulations.

In the future it should be the most common way to do things. Purely jump forms are expensive without too much of a gain. Hybrid methods should be the future. We just don't have the right tools to make that the case right now. But it's our next step now that we've gotten the foundation.

Another oddity is that what I would think is the main solver (e.g. Direct()) is supplied to the JumpProblem, and not like in almost all other cases, to the solve command (there are some technical reasons for this though, with what information needs to be created or something?).

That's not a solver. It's how jumps are aggregated into a single stochastic process. It's not required that jumps actually do that (for example, there's tau-leaping methods).

Yes, this should have a default algorithm on it. JumpProcesses.jl is just a bit behind on that. We have all of the data from the benchmarks to put together a nice algorithm defaulter, it just hasn't been done yet.

Finally, I have never used anything but SSAStepper() in `solve. Is there a reason that this is not a default argument, given how common it is?

ChrisRackauckas commented 9 months ago

Finally, I have never used anything but SSAStepper() in `solve. Is there a reason that this is not a default argument, given how common it is?

That should be the default algorithm when it's safe. Safety for this requires that the solution is truly only discrete, meaning that you have a DiscreteProblem with the identity map and no continuous callbacks. Double check that there's no other requirements? In that case it should default to using the SSAStepper, yes.

It seems like the ideal workflow would be (here using Catalyst):

No, that doesn't make sense. Direct() isn't a method that solves a generalized jump problem and seems to misunderstand the reasoning behind jump aggregation and the separation of concerns there. But:

rn = @reaction_network begin ... end
u0 = [...]
tspan = (...)
p = [...]

dprob = JumpProblem(rn, u0, tspan, p)
sol = solve(jprob)

is possible too, and that's what it should be instead. I don't think anyone would have qualms with that. Besides, people should almost never choose direct if we have a DSL creating the dependency graph for them, only for super small problems.

isaacsas commented 9 months ago

@Vilin97 has an open PR that I haven’t had a chance to review yet which addresses default aggregators.

ChrisRackauckas commented 9 months ago

(This is something that comes up whenever I show people how to do systems bio in Julia. "This is how to do ODE", "This is how to do SDE", and "Now we are going to do Gillespie, but we have to make these couple of modifications. No, I have no idea why we do this, it just needs to be done")

That vision isn't far enough. I want people to just write down models and get a hybrid ODE/SDE + tau-leaping + jump super algorithm that would take years to explain, but have it solve entire cell simulations in a second. That's what this should do:

rn = @reaction_network begin ... end
u0 = [...]
tspan = (...)
p = [...]

dprob = JumpProblem(rn, u0, tspan, p)
sol = solve(jprob)

Whatever the best simulation algorithm is. Right now we cannot do this. We are designed in such a way that the fundamentals all exist and are fast, but the next step is beyond the fundamentals. We will always keep "make the ODE version" and "the Gillespie version", but I don't want that to be the norm. Those won't be the best algorithms, that's just what we have today and I don't want to tunnel vision the API to the current limitations.

isaacsas commented 9 months ago

One related thing that needs to be added at the MTK level is setting up a simpler workflow when there are VariableRateJumps.

TorkelE commented 9 months ago

So, I agree that having hybrid simulations being the default should be the norm and possible (and have a good interface). I still think for tracking people about CRNs it is nice to start with Gillespie.

For cases where simulations take <0.1 second and we only really want 1 (which I'd say is many many cases), it doesn't really matter anyway, and I prefer using Gillespie to keep things simple. That said, I am all for full steam ahead on proper hybrid solvers and support for them.

ChrisRackauckas commented 9 months ago

So, I agree that having hybrid simulations being the default should be the norm and possible (and have a good interface). I still think for tracking people about CRNs it is nice to start with Gillespie.

I very much disagree because of experience here. I'll say I used to think that's the case, but I've learned from many many issues that you have major Dunning-Kruger effect when it comes to solvers. People generally learn a very simple heuristic about solvers ("Direct is better than NRM") and then make a very simplistic decision about solvers ("Therefore I always use Direct!"). They will tag along with whatever the first algorithm choice in the documentation is. This is why over the years I have aggressively made more and more of the DifferentialEquations.jl (and other) solver documentations use the default algorithms. I used to see a heck of a lot more issues like "Why is Tsit5 so bad on this problem?", which is obviously stiff or lower tolerance or something outside of the small unique range where Tsit5 should actually be chosen, but the user has hardcoded the algorithm anyways. When I started to recommend the default algorithm, I started receiving orders of magnitude less issues and Discourse posts whose solution was "you're using the wrong algorithm... why...?".

It's very easy to think "the benchmarks are there, go to the SciMLBenchmarks", but in practice the people who have the time to study the results of the hundreds of different runs are the developers. Everyone else is going to look at one picture and make a snap judgement, but solvers are more difficult than that so the snap judgement is wrong.

Anyways, what I can guarantee you is that if you start someone by saying "here's how you use Gillespie's method", they will come back to you a year later and say "why is Catalyst so bad at scaling to 10,000 reactions?", and we'll say "why are you still using Direct?". People will use exactly what you show them. The majority of people will not change the solver from what you gave them in the tutorial, they will copy that verbatim and recite the holy bible of the tutorial in every case they see from that point. If you point them to Direct in the first example, they will use Direct for everything. Even if the second example is "now Direct is only for small equations, so let's see ...". No, they will go to the first example, see that's how "it's supposed to be done", copy paste, and then ask about why the performance isn't what they wanted. I have seen this a million times and we have the analytics to show this.

This is the reason that the first tutorial should just be completely algorithm agnostic.

rn = @reaction_network begin ... end
u0 = [...]
tspan = (...)
p = [...]

dprob = JumpProblem(rn, u0, tspan, p)
sol = solve(jprob)

How things are solved is a detail. It's cool to us because we are mathematicians and developers, but it's unnecessary. It's taught to systems biologists today because it's a necessary detail of how to effectively do simulation work, but that is because of lack of tooling. Similar to how people used to be taught the physics of centrifuges (I know someone who's entire PhD thesis was on this topic), today everyone's lab has a centrifuge and you click the button and it works reliably enough that nobody is taught in biophysics class a two week module on how centrifuges work. Anything that can be cut from the curriculum will be, there's too much to teach. Right now people learn Gillespie's method because the software was so shitty that most systems biologists would just code up Gillespie's by themselves, so therefore it's a lore that's passed onto the next generation. When we break this cycle by having not only a nice front end to do all of this simulation but also better algorithms that should be used in the majority of cases, the exact details of what's going on will become the topic of just a niche group, just like how nobody today studies the control algorithms and tweaking details of centrifuges.

I think we are well on the right path to obsolete the teaching of our own subject to the general scientist community, the same way how the true inner workings of stiff ODE solvers is a subject that no one ever learns because people just say "ode15s does something complex and good so use it". In the next generation, hopefully most biologists don't even need to know what a stiff ODE is, the solver can take care of that. Our graduate students who are interested in how the simulation works can open up the breadth of papers that describe the underlying algorithms, but I would be happy if we ever get to the point where the rest of the scientists just don't have to care because it all "just works".

SciML's goal is to do this for all of its solver domains, along with all of the details like the connection to automatic differentiation and inverse problems and the model discovery aspects. Nobody should have to know what a UDE is, that's just how Catalyst autocompletes the model for you when you give it data. It's a 10-20 year plan but it's what I want of the future: declare the scientific model and let a fully automated system simulate it at 100% efficiency with the best algorithms and automatically handle all of the connections to data. All of its APIs are geared towards allowing this reality.