Update the way we use K-S tests to assist breakpoint selection

ATimHewson commented 3 years ago

We need a more complete way of using K-S tests to advise the user on potential breakpoints.

The key outputs we would like to send to the screen, to advise the user, are the standard K-S test outputs:

the D-statistic (max separation in y of the two CDFs), and
the P-value (likelihood that the CDF separation seen did not occur by chance)

A. We would like to see 1 and 2, from the outset, across a range of possible breakpoints for all the data (this is different to what is currently done, where the code assumes successive removal of left-of-breakpoint data prior to each re-application of the K-S test). So the first set of K-S test runs like this might result in the user selecting one breakpoint.

B. After A the user will likely want to repeat the above for all data on one side of the selected breakpoint, to see if another breakpoint could be utilised there. And then they might look at the portion of data that lies on the other side of the first breakpoint in the same way. And then they might want to further divide up in the same way.

To start A we need a way of deciding how many independent K-S tests to run, and a way to output the results. It is difficult to advise on how many tests would be optimal - it is a compromise between run time, and dataset size. Perhaps 20 would be good value to start out with (this could perhaps be a user-defined value?).

Then how do we decide which 20 potential breakpoints to test? There are two main options. One would be to evenly divide up the governing variable range, between max and min, into 21 subsets and use those 20 breakpoints. A second would be to rank all governing variable values and divide into 21 equally sized subsets, and accordingly use the dividing points that correspond; these would of course be less evenly spread than in the first option. Probably the second case is preferable to give a general overview.

As regards conveying the output of the 20(?) tests, we could have i) tabular text, and/or ii) a graph. The graph is visually more appealing but would be more work. In each case we need to convey what the breakpoint is (in governing variable units), and what the values of 1 and 2 above are. If we went for the graphical option we would need two y-axes, a linear one for the D-statistics and non-linear for the P-value (some form of logarithmic?). Values for the P-value may be limited to a maximum of 99.99 (2 decimal places) if what I remember from my limited python experience is correct. This may prove a bit of a limitation, because as I understand it very high P values which could be useful to us strictly need a supercomputer to derive (!), but let's see.

With the above implemented the process of semi-subjective but informed selection of breakpoints by the user should be much improved (with clear graphical output as supporting evidence, if we can achieve that).

Priority for this issue should be to get something working with tabular output; the graphical output may take quite a bit longer and so is lower priority unless it can be done quickly.

FatimaPillosu commented 3 years ago

Hi @onyb ,

on the basis of what Tim mentioned, I have created the Matlab code to show you what we might need to have in Python. The Matlab considers all the cases in the node that you are analysing. Then, the code runs the K-S for a pre-determined number of breakpoints determining the K-S's D statistic and the pvalue of the test (i.e. significance levels of the K-S test). Notice that the smaller the pvalue, the more significant the test result is. Finally, the code plots the Dstat and the (-1*log(pvalue)), in the same diagram for each of the considered breakpoints (see figure below). We have implemented a diagram with two different scales. The code provides you with the layout for the diagram that we might want to have, but if you have better ideas, please feel free to comment on them.

I have included in the code two different cases (i.e. case1: node at the top of the decision tree; case2: a node at the bottom of the decision tree). I did that to highlight the differences that you will encounter in these two situations. For case1, the pvalues are so so so so small that, even applying the log transformation, you cannot see anything, they are just set to 0. However, when you are at the bottom of the tree (where the node has a smaller number of cases), the pvalues start increasing and then you can see something in the diagram. Matlab applying a super high precision, so you can see that the pvalues, although very very very very small, they are not 0. I don't know how that might behold by Python. We need to see.

I'm putting this in priority 1 as, from what @ATimHewson told me, @EstiGascon and @AugustinVintzileos need this implementation as soon as possible. Thus, if there is something that it is not clear from Tim's explanation or my code, please, let's have a meeting as soon as you can. I think it would be quicker to discuss any problems directly rather than via email or via issues.

I add directly in the issue the Matlab code and the diagram for case2. I have also added in google drive the PDT.csv file that I used to create the plots, so we both start from the same data, and if results are different we can discuss them.

Cheers,

Fatima

Matlab Code

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%                                                               %
% TEST ON NEW WAY ON CREATING THE BREAKPOINTS FOR DECISION TREE %
%                                                               %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% The used K-S function is described here:
% https://uk.mathworks.com/help/stats/kstest2.html#btno0gd-ks2stat

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% INPUTS
InputData = 'PDT.mat';
NumBP = 20;
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

load(InputData)

% Selection of the decision tree node to analyse and the predictor to consider

% % CASE 1
% predictor = "cpr";
% node = "Top of the tree";
% units = "-";
% VAR = cpr;

% CASE 2
predictor = "sr24h";
node = "(cpr>0.25 & cpr<0.75) & (tp>10 & tp<=inf) & (wspd700>20 & wspd700<=inf) & (cape>-inf & cape<=50)";
units = "J/kg";
VAR = sr24h(cpr>0.25 & cpr<0.75 & tp_acc>10 & tp_acc<=9999 & wspd700>20 & wspd700<=9999 & cape_wa>-9999 & cape_wa<=50);

[VAR, ind] = sort(VAR);
SizeVAR = length(VAR);
Len_SplitsVAR = round(SizeVAR/(NumBP+1));
BPs_ind = (1:NumBP)' * Len_SplitsVAR;
BPs = VAR(BPs_ind);
BPs = unique (BPs);

indBP_1 = min(VAR);
indBP_3 = max(VAR);
pvalue_vec = zeros(length(BPs),1);
Dstat_vec = zeros(length(BPs),1);

for i = 1 : length(BPs)

    disp(i)

    indBP_2 = BPs(i);

    data2test_1 = FER(VAR>=indBP_1 & VAR<indBP_2);
    data2test_2 = FER(VAR>=indBP_2 & VAR<=indBP_3); 

    if ~isempty(data2test_1) && ~isempty(data2test_2)
        [h,pvalue,Dstat] = kstest2(data2test_1,data2test_2, 'Alpha', 0.05);

        pvalue_vec(i) = pvalue;
        Dstat_vec(i) = Dstat;
    end

end

figure
yyaxis left
plot(BPs,Dstat_vec, "o-", "LineWidth", 1.5)
title({["Node"], node})
xlabel(strcat("Breakpoints for ", predictor, " [", units, "]"))
ylabel("K-S Dstat")

yyaxis right
plot(BPs, (-1 * log(pvalue_vec)),  "o-", "LineWidth", 1.5)
ylabel("log(pvalue)")

Example

FatimaPillosu commented 3 years ago

Here there is an idea on how the GUI for the new way of using the K-S test might look like. Scan 28 Jan 2021.pdf

FatimaPillosu commented 3 years ago

Added the mockup for the new way of running the K-S test. Mockup.pptx

AugustinVintzileos commented 3 years ago

Thanks Fatima,

This looks very helpful! Question: Slide 7 why are you choosing 167? Combination of Dstat and pvalue?

Best,

Augustin

From: Fatima Pillosu notifications@github.com Sent: Friday, February 26, 2021 12:18 PM To: esowc/ecPoint-Calibrate ecPoint-Calibrate@noreply.github.com Cc: Augustin Vintzileos Augustin.Vintzileos@ecmwf.int; Mention mention@noreply.github.com Subject: Re: [esowc/ecPoint-Calibrate] Update the way we use K-S tests to assist breakpoint selection (#145)

Added the mockup for the new way of running the K-S test. Mockup.pptxhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fesowc%2FecPoint-Calibrate%2Ffiles%2F6051313%2FMockup.pptx&data=04%7C01%7CAugustin.Vintzileos%40ecmwf.int%7C8dd9de64b41942795ec808d8da7a7d3b%7C21b711c6aab74d369ffbac0357bc20ba%7C0%7C1%7C637499566904974782%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=EI8K51Ni%2FhBd6PM1nOp87nkLYRQLq%2FbQMMWLtYINbak%3D&reserved=0

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fesowc%2FecPoint-Calibrate%2Fissues%2F145%23issuecomment-786779547&data=04%7C01%7CAugustin.Vintzileos%40ecmwf.int%7C8dd9de64b41942795ec808d8da7a7d3b%7C21b711c6aab74d369ffbac0357bc20ba%7C0%7C1%7C637499566904974782%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=0i6Q%2Bd%2BruGa6GQj2OCSIhPz19JqNDfA5aw3bVLXNGOk%3D&reserved=0, or unsubscribehttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FARHK63II6OTYCXYYT6KEKR3TA7JVBANCNFSM4WI4Q7KA&data=04%7C01%7CAugustin.Vintzileos%40ecmwf.int%7C8dd9de64b41942795ec808d8da7a7d3b%7C21b711c6aab74d369ffbac0357bc20ba%7C0%7C1%7C637499566904984786%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=7p4o2EHUOloBuRAIpilqF28ZcN7dyTc9YXX1MpTz%2BrE%3D&reserved=0.

FatimaPillosu commented 3 years ago

Hi Augustin, those numbers are totally pulled out from the air. Please, don't give them any physical meaning. They were just there to help Anirudha visualize the GUI for the K-S test. Cheers, Fatima

FatimaPillosu commented 3 years ago

Hi @onyb ,

I'm suggesting the following changes to the GUI for the K-S test:

onyb commented 3 years ago

@FatimaPillosu I have a question regarding the bug that's causing spikes in the graph. If I recall correctly, our hypothesis was that we were considering the first and last item in the (sorted) predictor, and agreed to not consider them as breakpoints.

Just wanted to check if my understanding is correct with the following example:

Imagine we have 100 values in our predictor, and want to 5 breakpoints. Which of the following breakpoint indexes should be chosen? Note that the indexes range from 0 to 99.

A. [ 0, 25, 50, 74, 99]
Remark: this is the current version

B. [ 1, 25, 50, 74, 98]
Remark: pick 5 breakpoints, excluding first index (0) and last index (99)

C. [16, 33, 50, 66, 82]
Remark: pick 7 breakpoints ([0, 16, 33, 50, 66, 82, 99]), then exclude first index (0) and last index (99)

FatimaPillosu commented 3 years ago

Hi @onyb ,

the right answer is C. Indeed, you will have: n. of elements in each class = n. of tot points / (n. of breakpoints + 1) = 100 / (5 + 1) = 16.66666 If you round this number (to the lowest or to the highest value, it does not make any difference) you will get the indexes in C. There is no need at all to include the min and max value of your dataset in the list of indexes.

Cheers,

Fatima

FatimaPillosu commented 3 years ago

Hi Anirudha, The release 0.23.0 looks really good. Just a small point. In the diagram below, it would be really good to have also the units of the variable in the x axis.

onyb commented 3 years ago

@FatimaPillosu I was going to cover the list of things that I did not include in v0.23.0 because it would've otherwise taken me longer to ship this release. Here's the list of items to expect in the next release:

[x] The scale above the figure indicating min and max ranges.
[x] Display range of definitive breakpoint selected.
[x] Make the < and > buttons selectable, as in the mockups.
[x] Make definitive breakpoints editable and add a delete button to each.

Regarding the units on the x-axis, it turned out to be trickier than I thought. In the second module of the software, we don't have access to the predictor files, hence it's not possible to extract the units. We can try to read the units from the comments in the ASCII tables, but they are not structured in a way that's easy to parse. It may be slightly easier to do this with Parquet files, which has dedicated storage for metadata where we can store the units in a structured format.

onyb commented 3 years ago

Sneak peak

No breakpoint selected

Primary breakpoint selected (not definitive)

I'm going to keep this issue open until the next release is published.

onyb commented 3 years ago

Fixed in v0.24.0.

Note: Displaying the number of elements for a given definitive breakpoints range was not implemented since the value must be recomputed whenever the breakpoints are edited. Will tackle this separately.

ecmwf / ecpoint-calibrate