biolab / orange3

🍊 :bar_chart: :bulb: Orange: Interactive data analysis
https://orangedatamining.com
Other
4.82k stars 1.01k forks source link

Normalize Features with Preprocess widget #4925

Closed Pino-BQ closed 4 years ago

Pino-BQ commented 4 years ago

I am trying to normalize data using the Normalize Features of Preprocess widget. I select the first option: Standardize to u=0, d2=1. The results are not exactly the same as those that I obtained with other software such as Excel or Knime. Apparently the difference is because Orange uses the standard deviation of the population rather than the standard deviation of the sample to perform the normalization. Is this correct? Should it be possible to use the standard deviation of the sample?

I am using Orange 3.26.0 in MacOS Mojave (10.14.6)

janezd commented 4 years ago

What do you mean by the deviation of the sample? Which sample? What does your workflow look like?

Pino-BQ commented 4 years ago

The sample is the column data. The difference is using the population standard deviation (sqrt((x-av)2/N)) instead of the sample standard deviation (sqrt((x-av)2/(N-1))).

Pino-BQ commented 4 years ago

Hello, I have a simple data table and I want to normalize de columns. The question is what equation you use to compute the standard deviation, the one por the sample (sqrt((x-µ)^2/(N-1))) or the one for the population (sqrt((x-µ)^2/N)). I thought that the correct one the sample standard deviation, but I am not completely sure. Thanks for your help.

Best Regards,

Pino

El 30 jul 2020, a las 11:19a. m., Janez Demšar notifications@github.com escribió:

What do you mean by the deviation of the sample? Which sample? What does your workflow look like?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/biolab/orange3/issues/4925#issuecomment-666252532, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQOLTZI5L2ZQSCUB3ZFI7PTR6E3JZANCNFSM4PNH3GIQ.

Manuel M. Sánchez del Pino

Depto de Bioquímica y Biología Molecular Facultad de Ciencias Biológicas Universidad de Valencia Dr. Moliner, 50 46100 Burjassot (Valencia) Tel. 34-96-3543464 Fax 34-96-3544635 e-mail: sandelpi@uv.es

janezd commented 4 years ago

Oh, I'm sorry. I thought you meant a different data. Now I see.

I think the answer to your suggestion would be "no". This "sample standard deviation" is an "unbiased estimate of the population standard devation from a sample". The problem with the biased one is that it also estimates the mean value from the sample, so the degrees of freedom are N - 1, not N.

  1. The correction (dividing by N - 1) makes additional assumptions, in particular that the sample was drawn with replacement. This is usually not the case.
  2. Division by N - 1 corrects the variance, but not the deviation. Correcting the deviation requires knowing the underlying distribution (see, e.g. https://en.wikipedia.org/wiki/Standard_deviation#Unbiased_sample_standard_deviation).
  3. We might call the above two points nitpicking. They are, but then also dividing by N - 1 is nitpicking for any reasonably large N. For larger N's, the estimates will be bad (probably more or less useless) anyway. :)
  4. I would embrace the suggestion if the purpose of the widget was to estimate the population variance. The purpose of this widget is to put the data (most often a number of variables with very different scales, like age, height and number of bycicles owned and average weekly miles that this person runs in the Jardín del Turia) on the same scale. Here the divisor doesn't matter much. N is simply simpler, although N - 1 would be "less incorrect".
  5. How do we know that the data does not in fact represent the entire population? If we'd want to be totally correct, we'd need to add a checkbox to the widget, to let the user decide. But then -- you wrote "I thought that the correct one the sample standard deviation, but I am not completely sure." A great majority of users (including me, most of the time!) wouldn't know which option to use and would either use the default or lose time pondering about an inconsequential setting. We try to avoid overwhelming the user with too many parameters, in particular when they do not make any practial difference.

Still, thanks for your suggestion -- if for nothing else, it forced me to think about this, which is good. :)

Pino-BQ commented 4 years ago

It is clear enough. Thanks a lot for your explanation.

Pino

El 30 jul 2020, a las 2:18 p. m., Janez Demšar notifications@github.com escribió:

 Oh, I'm sorry. I thought you meant a different data. Now I see.

I think the answer to your suggestion would be "no". This "sample standard deviation" is an "unbiased estimate of the population standard devation from a sample". The problem with the biased one is that it also estimates the mean value from the sample, so the degrees of freedom are N - 1, not N.

The correction (dividing by N - 1) makes additional assumptions, in particular that the sample was drawn with replacement. This is usually not the case. Division by N - 1 corrects the variance, but not the deviation. Correcting the deviation requires knowing the underlying distribution (see, e.g. https://en.wikipedia.org/wiki/Standard_deviation#Unbiased_sample_standard_deviation). We might call the above two points nitpicking. They are, but then also dividing by N - 1 is nitpicking for any reasonably large N. For larger N's, the estimates will be bad (probably more or less useless) anyway. :) I would embrace the suggestion if the purpose of the widget was to estimate the population variance. The purpose of this widget is to put the data (most often a number of variables with very different scales, like age, height and number of bycicles owned and average weekly miles that this person runs in the Jardín del Turia) on the same scale. Here the divisor doesn't matter much. N is simply simpler, although N - 1 would be "less incorrect". How do we know that the data does not in fact represent the entire population? If we'd want to be totally correct, we'd need to add a checkbox to the widget, to let the user decide. But then -- you wrote "I thought that the correct one the sample standard deviation, but I am not completely sure." A great majority of users (including me, most of the time!) wouldn't know which option to use and would either use the default or lose time pondering about an inconsequential setting. We try to avoid overwhelming the user with too many parameters, in particular when they do not make any practial difference. Still, thanks for your suggestion -- if for nothing else, it forced me to think about this, which is good. :)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.