LRydin / KramersMoyal

kramersmoyal: Kramers-Moyal coefficients for stochastic data of any dimension, to any desired order
MIT License
67 stars 12 forks source link

Bandwidth? #13

Closed rodrigomp84 closed 2 years ago

rodrigomp84 commented 3 years ago

Dear authors,

I'm sorry to pollute the github project, but can you comment a little bit on the bandwidth parameter? It's the only free parameter and it seems to have an important influence on the results, but it is not discussed in the docs nor on the paper.

Thank you for this great work, best

LRydin commented 3 years ago

Hey @rodrigomp84, thanks for the question :).

Yeah, the bandwidth (bw in the code) is the only free parameter. It is comparable to a selection of bins in a histogram. In fact, this is something largely discussed in statistics, see e.g. Kernel density estimation (kde). What is on the background of this is foregoing of classical disjoin bins, i.e., placing things in separate boxes, for a far smother version, using kernels (whichever type you prefer) and summing them up to get the final density function. This is called non-parametric estimation as it does not depent on the parameters of the model/process you are using.

If you check out SciPy's gaussian_kde you will also find it there, the bw_method. In there it points even to some particular methods for choosing the bandwidth size, i.e., Scott's or Silverman's. Naturally, gaussian_kde implies using a Gaussian kernel. seaborn's kdeplot also uses kde with some bandwidth selection.

Now for the hard question: the choice of the bandwidth bw: This is the lifelong question of how many bins are good for a histogram?. Well, there you have a few options. Similarly, for kde you have some rule-of-thumb.kramersmoyal implements the simple Silverman's rule-of-thumb ¹ that relates the data's variance with the selection of the bandwidth bw. It is admittedly very simply, and it is one of the things I'd like to work on next, i.e., add other bandwidth selection methods (and something that at least tells the user which bandwidth was selected :) ).

Advantages over histograms: The paper Kernel-based regression of drift and diffusion coefficients of stochastic processes discusses the advantages of kde over histograms for stochastic processes. It even proposes a new bandwidth selection method (which, again, I would like to implement as well... but I haven't :P )

Fell free to ask more if you need, I prefer these things be clear than somewhat hidden and unexplained. It is true I don't explain this neither in the docs nor in the paper, somewhat this is always relegated to "Statistician's problems, not mine", but I will definitely include some explanation in the docs in the following update.

¹ Silverman, B. W. Density Estimation for Statistics and Data Analysis. London: Chapman & Hall/CRC. p. 45. ISBN 978-0-412-24620-3 (1986)

rodrigomp84 commented 3 years ago

Dear Leonardo,

this was extremely helpful, thank you very very much for the quick and detailed answer. I should have gone through Kde before trying to use the code, I wasn't familiar with it. Now I have a more clear idea and know where to look. This Physics Letters paper is very helpful by the way, at least for a physicist like myself!

muito obrigado! :)