Summary of kstest scores from multiple distributions

lidakanari commented 8 years ago

I need to get a summarizing score from a set of statistical tests between pairs of distributions that represent different features:

score1 = score(f1_dataset1, f1_dataset2)
score2 = score(f2_dataset1, f2_dataset2)
...

total_score = norm(scores, p) # scores: all previous || p: p-norm

chalimou commented 8 years ago

Different features have different distributions and it is not possible to average their p-values or even their scores (known as effect sizes in statistics). How can you average apples and oranges? You can only at some point decide whether apples or oranges are more important to you.

You could summarise scores only gained in nearly identical experimental setting, for example, when you test a drug in several groups under the same conditions, but it does not seem to me that is the case here.

On 20 Jan 2016, at 12:07, lidakanari notifications@github.com wrote:

I need to get a summarizing score from a set of statistical tests between pairs of distributions that represent different features:

score1 = score(f1_dataset1, f1_dataset2) score2 = score(f2_dataset1, f2_dataset2)

total_score = norm(scores, p) # scores: all previous || p: p-norm

— Reply to this email directly or view it on GitHub https://github.com/BlueBrain/NeuroM/issues/225.

lidakanari commented 8 years ago

I completely agree with you. However, we need a way to summarize the data because we are in need of a ranking function. This function is generic for any set of distributions. The user has to be careful to use a set of distributions that make sense for him.

So for example a normalization of the features is required if you really need to compare (or add) scores.

chalimou commented 8 years ago

If I understand right, the question is to summarise scores across features, in order to get a global score for the cell. This averaging across potentially different distributions and underlying processes is not possible by statistical means. I strongly discourage to provide a functionality like this because people tend to “press buttons” without thinking and blame the stupid results on the button providers.

One can always arrive to an end-desicion by common sense, making up sensible rules. This decision lacks statistical support, though. Maybe you have a more detailed situation in mind which I do not know. But if they want to summarise scores like section length comparison (exponential, let’s say) with bif. angle (modelled by gamma), sorry, NO.

On 20 Jan 2016, at 14:51, lidakanari notifications@github.com wrote:

I completely agree with you. However, we need a way to summarize the data because we are in need of a ranking function. This function is generic for any set of distributions. The user has to be careful to use a set of distributions that make sense for him.

So for example a normalization of the features is required if you really need to compare (or add) scores.

— Reply to this email directly or view it on GitHub https://github.com/BlueBrain/NeuroM/issues/225#issuecomment-173210346.

eleftherioszisis commented 8 years ago

@chalimou consider a more detailed example:

We have one manually reconstructed neuron which is the ground truth cell g

Then we also have N algorithms (a1, a2, ..., aN) that automatically reconstruct this neuron.

So in order to sort these algorithms with respect to how good job they have done, we thought that we can extract the available features for each algorithm and sum the ks score with the ground truth:

g (f1,f2,f3,f4)
a1 (f1,f2,f3,f4)
a2 (f1,f2,f3,f4)

score_a1 = p_norm(ks(a1.f1, g.f1), ks(a1.f2, g.f2)...)
score_a2 = p_norm(ks(a2.f1, g.f1), ks(a2.f2, g.f2)...)
.
.
.

and then sort with respect to the scores.

Is this a wrong approach?

lidakanari commented 8 years ago

First of all, we are not fitting distributions ;)

You have dataset1 -vs- dataset2 which is the section length of neuron1_version1 -vs- the section length of neuron1_version2 (you expect the differences to be minimal). That is not a distribution but a set of numbers that represent a feature. From this pair you get a score (ks, mann, any score). We need a way to combine the scores (not the fitted distributions or the data themselves). Any suggestions are more than welcome, keeping in mind that we need to have a result by the end of the Hackathon.

berthelm commented 8 years ago

Hi Eleftherios and Nancy:

but it looks like one is not averaging over features (which may have strange and different distributions) but averaging over ks scores, which are distances that are bounded below by 0. So one is just adding up a series of non-negative numbers for each algorithm, and the sums are ranked according to their magnitude.

This seems ok to me, or am I missing something?

Julian

chalimou commented 8 years ago

thanks very much Eleftherios for the detail. This decision making falls in the category "common sense". You average the scores with a norm in a kind of loss-function (see optimization) in order to decide which reconstruction procedure is better. You might even weight the scores if you have favourite features. Then you say: the reconstruction procedure with the best overall score wins the race. There is nothing wrong with that, you want to arrive to a decision somehow. Only take care not to perceive or sell this “overall score” as an “average ks-distance”, since there are no distributions associated with it. The ks-distance is the maximum distance between two empirical cumulative distribution functions. For the “overall-score” we do not have any distributions associated with it. Summarising: What you propose is a sensible heuristic to help you decide between reconstructing procedures and systems. It’s like evaluating a loss-functional for optimization. It lacks statistical meaning, since it does not have any associated distributions like the ks-distance has. But it’s nothing wrong with this as long as you are clear for yourself and make it clear to your clients. Use a formulation like “we build an overall score for the reconstruction process using the ks-distances of the feature distributions”. I hope I made myself clear, just ask if you have questions. This reminds me the difference between statistical and practical significance a bit.

On 20 Jan 2016, at 16:34, Eleftherios Zisis notifications@github.com wrote:

@chalimou https://github.com/chalimou consider a more detailed example:

We have one manually reconstructed neuron which is the ground truth cell g

Then we also have N algorithms (a1, a2, ..., aN) that automatically reconstruct this neuron.

So in order to sort these algorithms with respect to how good job they have done, we thought that we can extract the available features for each algorithm and sum the ks score with the ground truth:

g (f1,f2,f3,f4) a1 (f1,f2,f3,f4) a2 (f1,f2,f3,f4)

score_a1 = p_norm(ks(a1.f1, g.f1), ks(a1.f2, g.f2)...) score_a2 = p_norm(ks(a2.f1, g.f1), ks(a2.f2, g.f2)...) . . . and then sort with respect to the scores.

Is this a wrong approach?

— Reply to this email directly or view it on GitHub https://github.com/BlueBrain/NeuroM/issues/225#issuecomment-173240521.

BlueBrain / NeuroM

Summary of kstest scores from multiple distributions #225