JuliaStats / KernelDensity.jl

Kernel density estimators for Julia
Other
175 stars 40 forks source link

Sensible Default bin size #28

Closed OmriTreidel closed 8 years ago

OmriTreidel commented 8 years ago

Currently the kde methods either require the user to provide number of bins, the midpoints or default to 2048. This can be a problem for small datasets. It seems like it would be nice to have a sensible default like the on in http://stats.stackexchange.com/questions/798/calculating-optimal-number-of-bins-in-a-histogram or some other rule of thumb.

` bin_size = 2_IQR(data)_length(data)^(-1/3)

midpoints= max(data):bin_size:min(data)) `

ararslan commented 8 years ago

That seems like a good idea to me. Would you be interested in putting together a PR for this?

OmriTreidel commented 8 years ago

Yep, I'm already working on it.

simonbyrne commented 8 years ago

Note that the choice of the number of bins here should be different than a histogram.

In a histogram, you choose the number of bins as a method of avoiding overfitting (i.e. regularization).

For a KDE, the number of bins just affects the numerical resolution of the resulting function, so you want to choose as many as your computational budget allows (up to the resolution of your screen, or whatever needs you have). Ideally it should also be a power of 2 to gain the most advantage from the FFTs. The regularization is handled by the kernel function.

The 2048 was admittedly a pretty arbitrary pick, based on scaling up R's choice (512) by a bit.

OmriTreidel commented 8 years ago

Thank for that comments, I haven't noticed that. This ticket seems rather pointless than. Unless there is another reason to do it?

ararslan commented 8 years ago

We could implement a different a more data-aware default than 2048. Perhaps there's some literature around that recommends something along those lines for kernel density estimation rather than histograms?

OmriTreidel commented 8 years ago

I think Simon is right, it doesn't seem to make any difference for the resulting density other than sampling.

ararslan commented 8 years ago

I think Simon is right

Agreed. After all, when isn't he right? 😄

OmriTreidel commented 8 years ago

Whenever he is talking to his wife/girlfriend ;)