garethjns / Kaggle-EEG

Seizure prediction from EEG data using machine learning. 3rd place solution for Kaggle/Uni Melbourne seizure prediction competition.
101 stars 29 forks source link

Question on interpretation of output results #11

Closed ThomasDang93 closed 7 years ago

ThomasDang93 commented 7 years ago

Hi there, I recently finished running the code after 5 days of training and testing over the data. I consider myself a beginner in machine learning, so I have a few questions on the output of the code.

Firstly, the predict.m file generated a .csv file that contains columns for "File" and "Class". Here is a screenshot of what that output looks like.

master61svmgrbtg

I guess my question is, why is it that there are values lower than 0 and greater than 1? I looked at the evaluation overview on the kaggle competition right here, and it seems that the values for the class variable are suppose to be probabilities(between 0 and 1).

The second question I wanted to ask was that in the README you said that "If seeds are now setting correctly, should score ~0.8059 (= 2nd place)".

When I ran train.m, I got an output like this. train_output

So are those AUC scores the ones that you were referring to in the README? Or is the README notes referring to some other score that is generated by predict.m? When I ran your code I did not get a score from predict.m. So right now I can only assume that the score that I am suppose to look at is the AUC of SVM and RBT, but maybe you can clarify what those things mean for me?

garethjns commented 7 years ago

Hi Thomas,

Glad you got the code running, out of interest, what were the specs of the computer you ran it on? I'm interested as this version of the code has the "manual" parallel processing aspects removed, although, some of the MATLAB functions (FFT, model training, etc.) are inherently multithreaded. Training on my machine using already extracted features takes around 2700s (it should be much faster if you run the training again, as the features are saved so don’t need extracting again – I’m assuming this is the stage that took up most of the time).

To answer your questions:

The output.csv is formatted as required by the Kaggle competition. Although the prediction column is called 'Class', you're correct, it should actually the prediction probability rather than the class label (which seems to be fairly common in Kaggle classification competitions). Generally, this means that if you apply a threshold yourself and submit class labels it'll work, but likely to increase the loss and harm your score.

In this case, the scoring is AUC, and the Kaggle scoring script normalises the probabilities first (I think by zscore). Regardless of the normalisation, the I don't think the absolute range of the values matters for calculating AUC - it's the distribution of values within the min-max range that matter, rather than the absolute values.

The AUC scores printed to the command window by train.m are the scores for the two individual models on the training data. The values are the average of the AUC for each of 6 CV folds – the individual scores are in RBTg.AUCs{k} and SVMg.AUCs{k}, where k is the fold.

You can also get the training AUC and plot for each model with the commands:

[X,Y,~, RBTAUC] = perfcurve(featuresTrain.labels, RBTg.predict(featuresTrain.dataSet), 1); plot(X,Y)
[X,Y,~, SVMAUC] = perfcurve(featuresTrain.labels, SVMg.predict(featuresTrain.dataSet), 1); plot(X,Y)

The score in the readme is the combined predictions of both models on the test data (as in the ‘Class’ column of the submission file), ie.

YPred = mean(zscore(RBTPred), zscore(SVMPred))

The combined predictions score significantly better than the individual model predictions, probably because the models both overfit slightly, but to different aspects of the data.

Predict.m won’t give a score itself as the labels for the test set are unknown, so the submission file needs to be submitted on Kaggle to get the final score – this may not score correctly for you, though, if you only have the first test set.

ThomasDang93 commented 7 years ago

My hardware is:

Processor: Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz (4 CPUs), ~2.2GHz 8 GB DDR3 Ram Intel(R) HD Graphics 5500

Also, I decided to run the command below:

[X,Y,~, RBTAUC] = perfcurve(featuresTrain.labels, RBTg.predict(featuresTrain.dataSet), 1); plot(X,Y)
[X,Y,~, SVMAUC] = perfcurve(featuresTrain.labels, SVMg.predict(featuresTrain.dataSet), 1); plot(X,Y)

Those two lines of codes did give me a plot graph and two AUC scores. The SVM scored 0.9512 and the RBTAUC scored 0.8353. So I am curious, why is it that these AUC scores are higher than the one generated by train.m? Is it because they don't use cross validation?

I also tried to run: YPred = mean(zscore2(RBTPred), zscore2(SVMPred)) But the command window kept saying Undefined function or variable 'RBTPred'.

I looked at zscore2.m and I am confused as to what arguments I should use on zscore since the function written as:

function [z,mu,sigma] = zscore2(x,flag,dim)

I have not fully understood your entire code since I am still learning the basics of MATLAB, so I wouldn't be surprised if there was just something minor that I am missing. I would be really thankful if you can help me generate this combined score of SVM and RBT. And thanks for the help that you have already given me so far.

garethjns commented 7 years ago

I'm not entirely sure about the differences in the AUC values. The models both use K-fold cross-validation, and the fit for each fold has it's own AUC score. The value printed to the command line is the average value of these scores. The score from the code above is the AUC calculated after averaging across the predictions from each fold. These values will be different, but it's not immediately clear why they're so different...

When you run

YPred = mean(zscore2(RBTPred), zscore2(SVMPred))

Get the predictions from each model and save them in the variables RBTPred and SVMPred first - ie.

RBTPred = RBTg.predict(featuresTrain.dataSet)
SVMPred = SVMg.predict(featuresTrain.dataSet)
YPred = mean(zscore2(RBTPred), zscore2(SVMPred))

The zscore2 function is the same as MATLAB's zscore function, but it handles NaNs using the nanmean and nanstd functions (rather than the mean and std functions). It doesn't need the second two inputs in this case so don't worry about those.

Once you have the combined predictions (in YPred) you can then get the AUC in a similar way as with the individual models:

[X,Y,~, overallAUC] = perfcurve(featuresTrain.labels, YPred, 1); plot(X,Y)

Exactly what this value will be for the training data, I'm not sure!

ThomasDang93 commented 7 years ago

Okay, I will send you an email now. Also, I tried running these series of commands that you showed me:

RBTPred = RBTg.predict(featuresTrain.dataSet)
SVMPred = SVMg.predict(featuresTrain.dataSet)
YPred = mean(zscore2(RBTPred), zscore2(SVMPred))

But I got an error on YPred. I did successfully saved SVMPred and RBTPred, but the mean function keeps throwing this error:

Error using sum
Dimension argument must be a positive integer scalar within indexing range.

Error in mean (line 116)
        y = sum(x, dim, flag)/size(x,dim);

So it looks like the sum function does not accept negative values. I noticed that the second argument in sum is dim, so that seems to explain why it does not accept negative values seeing as how dimension cannot be negative. So should I try to combine both RBTPred and SVMPred together? If so, how would I do it in a way that would reflect an accurate score?

garethjns commented 7 years ago

Sorry for the confusion, that's actually the wrong command, it's missing the concatenation with []. It should be:

YPred = nanmean([zscore2(RBTPred), zscore2(SVMPred], 2)

When it comes to predicting the test set there's an additional step as well. The sets are subdivided into short windows but the Kaggle submission only needs one prediction per 10 minute file. The predictions for each 10 minute file are averaged in this part of predict.m:

% Predict for each epoch
% Using seizureModel.predict()
preds.Epochs.RBTg = RBTg.predict(featuresTest.dataSet);
preds.Epochs.SVMg = SVMg.predict(featuresTest.dataSet);

% Compress predictions nEpochs -> nFiles (nSegs)
% Take predictions for all epochs, reduces these down to length of fileList
% Total number of epochs
nEps = height(featuresTest.dataSet);
% Number of epochs per subSeg
eps = featuresTest.SSL.Of(1);

% Convert SubSegID to 1:height(fileList)
accArray = reshape(repmat((1:nEps/eps),eps,1), 1, nEps)';

% Use to accumulate values and average
fns = fieldnames(preds.Epochs);
for f = 1:numel(fns)
    fn = fns{f};
    preds.Segs.(fn) = accumarray(accArray, preds.Epochs.(fn))/eps;
end
clear accArray

Then the final step (including the across-model normalisation and averaging bit):

% Combined sub: SVMg and RBTg
saveSub([note,'SVMgRBTg'], featuresTest.fileLists, ...
    nanmean([zscore2(preds.Segs.RBTg),zscore2(preds.Segs.SVMg)],2), ...
    params)