Sparse identification proper workflow!

aelmokadem commented 3 years ago

Hi,

I have a general question about the proper workflow to use when recovering unknown dynamics.

My understanding is that if we want to recover Michelis Menten dynamics we would use the implicit sparse identification! But what if we don't know if the missing dynamics would follow Michelis Menten or not? What would be the workflow then? Do we start with explicit sindy then switch? What if the explicit result was close enough to the observed data? What sort of diagnostic metrics would we use to make the decision to switch to implicit sindy? Is implicit sindy a global algorithm that we can just always start with for any problem or do we use both explicit and implicit approaches on any problem and compare?

Thanks.

AlCap23 commented 3 years ago

Hi! Thanks for your interest and welcome!

This is a difficult question that has no answer in the sense of a programmable approach but relies heavily on heuristics.

If you already know you are dealing with a specific model, e.g. Michaelis Menten, then you are dealing with Parameter Identification, suitable for packages like DiffEqFlux.

If not, then you have to leverage intuition and domain-specific knowledge about the process at hand. For example, Michaelis-Menten uses a feedback loop that effectively limits the state. This is quite common in domains like chemistry, pharmacy, biology, and even mechanics. Hence, we might recover the model via the implicit method. So anytime a DAE system pops up, I would switch to these methods. If I do not know anything about these relations in advance or you assume the effect to be neglectable, just use explicit approaches and check the AICC and L2 Norm error. As a rule of thumb, you should use both and check which model fits your data the best.

Try to capture everything you know inside the basis. Your system ( like a pendulum ) lives on a Torus -> use trigonometric functions inside the basis. My domain is strictly positive, maybe a log transform makes sense. These are semi-automatic approaches, which have not been implemented but are planned to.

Everything relies on the basis of functions you choose. Choosing these is not trivial, highly domain-specific, and in general NP-hard.

If the model is identifiable over the collected data, then SINDy should be able to identify it if the right basis functions are present. But this is quite a big if :).

I hope this is at least somewhat helpful.

ChrisRackauckas commented 3 years ago

And some of the methods don't require a basis.

I think we need to get a lot more practice with the whole stack before recommending "try optimizer X then Y, but use Z if you have W many equations".

aelmokadem commented 3 years ago

Thanks @AlCap23 and @ChrisRackauckas for the detailed reply and input. This was really helpful and totally agrees with my initial intuition that there is no simple answer here. I also totally agree that a lot more practice is needed before making recommendations.

@AlCap23 It is interesting you brought up the basis here because I was thinking that one way to use explicit sindy and still capture saturable kinetics as MM would be to include a time-dependent control signal in the basis. So if the system is a simple MM like: Vm*u[1] / (Km + u[1]) Then in theory including a control signal in the form of Vm/(Km + u[1]) could potentially be able to retrieve the dynamics !! If the saturable dynamics are not really that influential in capturing the data, then sindy would simply recover linear dynamics !! Is this a correct conclusion or am I off here?

AlCap23 commented 3 years ago

Since we are using a general linear model in SINDy, you would have to provide something which adds up to the true term. If you would include a control signal as you proposed, e.g. c(t) = V / (K + u(t)) you still wouldn't recover du(t) = V*u(t) / (K+u(t)). Adding u(t)*c(t)would work, but than you have your system modeled already. But maybe I am misunderstanding what you're trying to do.

aelmokadem commented 3 years ago

Yes that's what I meant. To add c(t) = V / (K + u(t)) with u(t)*c(t). This would add to the basis the MM dynamics in addition to the linear terms so the algorithm would pick the best one to describe the data which does not require the user to know before hand if she is dealing with linear or MM dynamics !!

SciML / DataDrivenDiffEq.jl

Sparse identification proper workflow! #244