ZELLMECHANIK-DRESDEN / ShapeOut

Shape-Out has been superseded by Shape-Out 2.
https://shapeout.readthedocs.io
GNU General Public License v2.0
5 stars 5 forks source link

Mode extraction #72

Closed SteGol4 closed 8 years ago

SteGol4 commented 8 years ago

Hey everyone,

the way the mode value is extracted from the population does not seem to work correctly at the moment, since only the most frequent number is returned. This is not necessarily the mode of the distribution, as values can randomly occur multiple times also far off the population maximum where there might be no or less doubling of values (which is the case for some datasets I tested).

One could avoid this issue by:

a) do some careful rounding of values (include only first 3 decimal places instead of 4) such that the most frequent number is highly likely to represent the distribution mode value (to the cost of accuracy)

b) create a histogram of the population, adjust bin size e.g. by Freedman-Diaconis rule and extract the mode from the center of the highest bin (to the risk of ambiguity since mode value depends on bin size)

c) plot the cumulative frequency which does not require any binning or rounding and derive the infliction point of the graph (which is the mode) by e.g. numerically calculating the derivatives of the obtained graph (to the risk that the infliction point may be wrong or hard to find if the data set is largely spread out)

These are my naive ideas about the mode value, which in my eyes, is a good representation of a skewed distribution (the median can be far off!).

Best, Stefan

paulmueller commented 8 years ago

I am marking this as a bug, because the mode in the current implementation does not represent a useful physical quantity.

I would vote for (a), because it is probably the fastest. Computing the histogram first (b) might be a little slower. Freedman-Diaconis could still be used in (a) for rounding, though.

@phidahl Is there already an existing solution that is fast?

phidahl commented 8 years ago

The mode value depends on the distribution function you assume. For deformation values we assumed so far a log-normal distribution. As far as I remember I got the parameters of the log-normal pdf by tranforming the deformation data into log-space, calculating mean and standard deviation with numpy, using these parameters for the log-normal pdf. Here: https://en.wikipedia.org/wiki/Log-normal_distribution you find the formulas for mean, mode and so on as functions of µ and sigma.

I would discourage fitting to a histogram, because of the problem how to bin properly in an automated way.

For other quantities on could assume a gaussian pdf, then mode and mean are equivalent.

SteGol4 commented 8 years ago

I agree that fitting to a histogram is very prone to the binning of the data. Consequently I like the idea of just returning the most common value of the distribution after intelligent rounding of the data. However this has to be done with caution. What do you think?

phidahl commented 8 years ago

Rounding and counting would be the same as binning.

paulmueller commented 8 years ago

Wouldn't it be unintuitive for the user when different distributions are assumed for different data columns (log-normal for deformation, gaussian for the others)?

I think that the mode column should be the same for all data columns. The bin size could be set with the Freedman-Diaconis rule. No fitting or assumption for a distribution is required. For deformation data, a log-normal-mode value could be displayed in addition. Is this acceptable?

phidahl commented 8 years ago

For me it would be fine. The way over log transformation was just applied so far. i dont know if there is someone who wants to keep it.

Von meinem Telefon gesendet

Am 08.06.2016 um 20:13 schrieb Paul Müller notifications@github.com:

Wouldn't it be unintuitive for the user when different distributions are assumed for different data columns (log-normal for deformation, gaussian for the others)?

I think that the mode column should be the same for all data columns. The bin size could be set with the Freedman-Diaconis rule. No fitting or assumption for a distribution is required. For deformation data, a log-normal-mode value could be displayed in addition. Is this acceptable?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

paulmueller commented 8 years ago

I have implemented the Freedman-Diaconis rule. I am not implementing the mode of the lognormal distribution. If that is required, please open a new issue.