cjlin1 / libsvm

LIBSVM -- A Library for Support Vector Machines
https://www.csie.ntu.edu.tw/~cjlin/libsvm/
BSD 3-Clause "New" or "Revised" License
4.54k stars 1.63k forks source link

Nested cross-validation #163

Open ziqianwang9 opened 4 years ago

ziqianwang9 commented 4 years ago

Dear Lin, thanks for providing this useful toolbox. I'm trying to use it to publish a paper, here I met some problem from the reviewer. He suggested me to use the nested cross-validation. Here I list the script I used for my study:

clear all;
load median20190923.mat

%leave-one-out cross-validation
w = zeros(size(data_all));% weight
h = waitbar(0,'please wait..');

for i = 1:size(data_all,1)
    waitbar(i/size(data_all,1),h,[num2str(i),'/',num2str(size(data_all,1))])
    new_DATA = data_all;
    new_label  = label;
    test_data   = data_all(i,:); new_DATA(i,:) = []; train_data = new_DATA;
    test_label   = label(i,:);new_label(i,:) = [];train_label = new_label;

%  Data Normalization
    [train_data,PS] = mapminmax(train_data',0,1);
    test_data          = mapminmax('apply',test_data',PS);
    train_data = train_data';
    test_data   = test_data';

    % RFE feature selectioin
    step = 1;
    ftRank = SVMRFE(train_label,train_data, step,'-t 0');
    IX = ftRank(1:ceil(length(ftRank)*0.4));

    [bestacc,bestc] = SVMcgForClass_NoDisplay_linear(train_label,train_data(:,IX),-10,10,5,0.1);
    cmd = ['-t 0 ', ' -c ',num2str(bestc),' -w1 2 -w-1 1'];

    model = svmtrain(train_label,train_data(:,IX),cmd);
    w(i,IX)   = model.SVs'*model.sv_coef; 
    [predicted_label, accuracy, deci] = svmpredict(test_label,test_data(:,IX),model);
    acc(i,1) = accuracy(1);
    deci_value(i,1) = deci;
%     clear  test_data  train_data test_label train_label model IX k
end
w_msk = double(sum(w~=0,1)==size(w,1));
w = mean(w,1).*w_msk;
acc_final = mean(acc);
disp(['accuracy - ',num2str(acc_final)]);

% ROC
[X,Y,T,AUC] = perfcurve(label,deci_value,1);
figure;plot(X,Y);hold on;plot(X,X,'-');
xlabel('False positive rate'); ylabel('True positive rate');

for i=1:length(X)
    Cut_off(i,1) = (1-X(i))*Y(i);
end
[~,maxind] = max(Cut_off);
Specificity = 1-X(maxind);
Sensitivty = Y(maxind);
disp(['Specificity= ', num2str(Specificity)]);
disp(['Sensitivty= ', num2str(Sensitivty)]);

fprintf('Permutation test ......\n');
Nsloop = 5000;
auc_rand = zeros(Nsloop,1);
for i=1:Nsloop
    label_rand = randperm(length(label));
    deci_value_rand = deci_value(label_rand);
    [~,~,~,auc_rand(i)] = perfcurve(label,deci_value_rand,1);
    clear label_rand
end
p_auc = (length(find((auc_rand > AUC)))+1)/(Nsloop+1);
disp(['Pvalue= ', num2str(p_auc)]);

Here, what I used is leave-one-out cross-valitaion. But the reviewer suggest me to use the neseted cross-valitaion(e.g. Varoquaux et al., Neuroimage, 2017) and K-fold. Since I am not familiar with nested cross-validation. Is it any possible we perform it based on your libsvm? If it is, could you please give me some clue how to achieve this?

Best, Ziqian

cjlin1 commented 4 years ago

To implement CV in matlab what you need to do are

num_per_fold = ceil(num_data/num_fold); for i = 1 : num_fold range = (i-1)num_per_fold + 1 : min(num_data, inum_per_fold);

On 2020-02-20 23:56, ziqianwang9 wrote:

Dear Lin, thanks for providing this useful toolbox. I'm trying to use it to publish a paper, here I met some problem from the reviewer. He suggested me to use the nested cross-validation. Here I list the script I used for my study:

clear all; load median20190923.mat

%leave-one-out cross-validation w = zeros(size(data_all));% weight h = waitbar(0,'please wait..');

for i = 1:size(data_all,1)

waitbar(i/size(data_all,1),h,[num2str(i),'/',num2str(size(data_all,1))]) new_DATA = data_all; new_label = label; test_data = data_all(i,:); new_DATA(i,:) = []; train_data = new_DATA; test_label = label(i,:);new_label(i,:) = [];train_label = new_label;

% Data Normalization [train_data,PS] = mapminmax(train_data',0,1); test_data = mapminmax('apply',test_data',PS); train_data = train_data'; test_data = test_data';

% RFE feature selectioin
step = 1;
ftRank = SVMRFE(train_label,train_data, step,'-t 0');
IX = ftRank(1:ceil(length(ftRank)*0.4));

[bestacc,bestc] =

SVMcgForClass_NoDisplay_linear(train_label,train_data(:,IX),-10,10,5,0.1); cmd = ['-t 0 ', ' -c ',num2str(bestc),' -w1 2 -w-1 1'];

model = svmtrain(train_label,train_data(:,IX),cmd);
w(i,IX)   = model.SVs'*model.sv_coef;
[predicted_label, accuracy, deci] =

svmpredict(test_label,test_data(:,IX),model); acc(i,1) = accuracy(1); deci_value(i,1) = deci; % clear test_data train_data test_label train_label model IX k end w_msk = double(sum(w~=0,1)==size(w,1)); w = mean(w,1).*w_msk; acc_final = mean(acc); disp(['accuracy - ',num2str(acc_final)]);

% ROC [X,Y,T,AUC] = perfcurve(label,deci_value,1); figure;plot(X,Y);hold on;plot(X,X,'-'); xlabel('False positive rate'); ylabel('True positive rate');

for i=1:length(X) Cut_off(i,1) = (1-X(i))*Y(i); end [~,maxind] = max(Cut_off); Specificity = 1-X(maxind); Sensitivty = Y(maxind); disp(['Specificity= ', num2str(Specificity)]); disp(['Sensitivty= ', num2str(Sensitivty)]);

fprintf('Permutation test ......\n'); Nsloop = 5000; auc_rand = zeros(Nsloop,1); for i=1:Nsloop label_rand = randperm(length(label)); deci_value_rand = deci_value(label_rand); [~,~,~,auc_rand(i)] = perfcurve(label,deci_value_rand,1); clear label_rand end p_auc = (length(find((auc_rand > AUC)))+1)/(Nsloop+1); disp(['Pvalue= ', num2str(p_auc)]);

Here, what I used is leave-one-out cross-valitaion. But the reviewer suggest me to use the neseted cross-valitaion(e.g. Varoquaux et al., Neuroimage, 2017) and K-fold. Since I am not familiar with nested cross-validation. Is it any possible we perform it based on your libsvm? If it is, could you please give me some clue how to achieve this?

Best, Ziqian

-- You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub [1], or unsubscribe [2]. [ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/cjlin1/libsvm/issues/163?email_source=notifications\u0026email_token=ABI3BHV62VSU7IEJTR5GH23RD2R2ZA5CNFSM4KYRVGX2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IPBCY3A", "url": "https://github.com/cjlin1/libsvm/issues/163?email_source=notifications\u0026email_token=ABI3BHV62VSU7IEJTR5GH23RD2R2ZA5CNFSM4KYRVGX2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IPBCY3A", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

Links:

[1] https://github.com/cjlin1/libsvm/issues/163?email_source=notifications&email_token=ABI3BHV62VSU7IEJTR5GH23RD2R2ZA5CNFSM4KYRVGX2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IPBCY3A [2] https://github.com/notifications/unsubscribe-auth/ABI3BHRYMTNKNGKRC3P4DTTRD2R2ZANCNFSM4KYRVGXQ

ziqianwang9 commented 4 years ago

Thank your for your reply. As the knowledge I have, the nested is not 2-level CV. This figure could illustrate what is nested CV:

The nested CV has an inner loop CV nested in an outer CV. The inner loop is responsible for model selection/hyperparameter tuning (similar to validation set), while the outer loop is for error estimation (test set).

My question is how is our '[bestacc,bestc] =SVMcgForClass_NoDisplay_linear(train_label,train_data(:,IX),-10,10,5,0.1)’ working on hyperparameter tuning? Do we use the similar method? If not, can we combine it with SVMcgForClass_NoDisplay_linear? Any response will be helpful.

Best, Ziqian

在 2020年2月20日,下午10:30,Chih-Jen Lin notifications@github.com 写道:

To implement CV in matlab what you need to do are

  • randomly permute data by randperm()

  • use a for loop to get each validation fold

num_per_fold = ceil(num_data/num_fold); for i = 1 : num_fold range = (i-1)num_per_fold + 1 : min(num_data, inum_per_fold);

  • then use this "range" to extract the validation fold. The training fold can be get by a similar way

  • then do training/prediction, and aggregate results to get CV acuracy

  • for nested CV I think you mean 2-level CV. You can use a 2-level for loop on that

On 2020-02-20 23:56, ziqianwang9 wrote:

Dear Lin, thanks for providing this useful toolbox. I'm trying to use it to publish a paper, here I met some problem from the reviewer. He suggested me to use the nested cross-validation. Here I list the script I used for my study:

clear all; load median20190923.mat

%leave-one-out cross-validation w = zeros(size(data_all));% weight h = waitbar(0,'please wait..');

for i = 1:size(data_all,1)

waitbar(i/size(data_all,1),h,[num2str(i),'/',num2str(size(data_all,1))]) new_DATA = data_all; new_label = label; test_data = data_all(i,:); new_DATA(i,:) = []; train_data = new_DATA; test_label = label(i,:);new_label(i,:) = [];train_label = new_label;

% Data Normalization [train_data,PS] = mapminmax(train_data',0,1); test_data = mapminmax('apply',test_data',PS); train_data = train_data'; test_data = test_data';

% RFE feature selectioin step = 1; ftRank = SVMRFE(train_label,train_data, step,'-t 0'); IX = ftRank(1:ceil(length(ftRank)*0.4));

[bestacc,bestc] = SVMcgForClass_NoDisplay_linear(train_label,train_data(:,IX),-10,10,5,0.1); cmd = ['-t 0 ', ' -c ',num2str(bestc),' -w1 2 -w-1 1'];

model = svmtrain(train_label,train_data(:,IX),cmd); w(i,IX) = model.SVs'model.sv_coef; [predicted_label, accuracy, deci] = svmpredict(test_label,test_data(:,IX),model); acc(i,1) = accuracy(1); deci_value(i,1) = deci; % clear test_data train_data test_label train_label model IX k end w_msk = double(sum(w~=0,1)==size(w,1)); w = mean(w,1).w_msk; acc_final = mean(acc); disp(['accuracy - ',num2str(acc_final)]);

% ROC [X,Y,T,AUC] = perfcurve(label,deci_value,1); figure;plot(X,Y);hold on;plot(X,X,'-'); xlabel('False positive rate'); ylabel('True positive rate');

for i=1:length(X) Cut_off(i,1) = (1-X(i))*Y(i); end [~,maxind] = max(Cut_off); Specificity = 1-X(maxind); Sensitivty = Y(maxind); disp(['Specificity= ', num2str(Specificity)]); disp(['Sensitivty= ', num2str(Sensitivty)]);

fprintf('Permutation test ......\n'); Nsloop = 5000; auc_rand = zeros(Nsloop,1); for i=1:Nsloop label_rand = randperm(length(label)); deci_value_rand = deci_value(label_rand); [~,~,~,auc_rand(i)] = perfcurve(label,deci_value_rand,1); clear label_rand end p_auc = (length(find((auc_rand > AUC)))+1)/(Nsloop+1); disp(['Pvalue= ', num2str(p_auc)]);

Here, what I used is leave-one-out cross-valitaion. But the reviewer suggest me to use the neseted cross-valitaion(e.g. Varoquaux et al., Neuroimage, 2017) and K-fold. Since I am not familiar with nested cross-validation. Is it any possible we perform it based on your libsvm? If it is, could you please give me some clue how to achieve this?

Best, Ziqian

-- You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub [1], or unsubscribe [2]. [ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/cjlin1/libsvm/issues/163?email_source=notifications\u0026email_token=ABI3BHV62VSU7IEJTR5GH23RD2R2ZA5CNFSM4KYRVGX2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IPBCY3A", "url": "https://github.com/cjlin1/libsvm/issues/163?email_source=notifications\u0026email_token=ABI3BHV62VSU7IEJTR5GH23RD2R2ZA5CNFSM4KYRVGX2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IPBCY3A", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

Links:

[1] https://github.com/cjlin1/libsvm/issues/163?email_source=notifications&email_token=ABI3BHV62VSU7IEJTR5GH23RD2R2ZA5CNFSM4KYRVGX2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IPBCY3A [2] https://github.com/notifications/unsubscribe-auth/ABI3BHRYMTNKNGKRC3P4DTTRD2R2ZANCNFSM4KYRVGXQ — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cjlin1/libsvm/issues/163?email_source=notifications&email_token=AH4SOUKYJP2I4QT47KCVPEDRD3ZAZA5CNFSM4KYRVGX2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMQG5YI#issuecomment-589328097, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH4SOUIRJU2YNGYBVC5OCHTRD3ZAZANCNFSM4KYRVGXQ.

ziqianwang9 commented 4 years ago

Dear Lin, I found that this nested VC add grid search in every loop of inner loop. If it’s 5-fold, it calculate 5 best-c, then calculate arithmetic mean/geometric mean or power mean. Here is also a description in Chinese: 这个思想有两个循环(loop):(1)外循环就是普通的cross validation (2)内循环相当于是一个子优化问题,通过grid search寻常当前子问题中模型对应的最优参数。grid search就相当于是遍历有限的空间点(每一个点对应于一组参数),每一组参数对应一个模型的performance,然后选取performance最好的模型。

cross validation用了几个fold最后就有几组模型参数,如果你的模型是stable的,那么这几组参数应该类似。

I don’t know if this is the state of art. But it should be a good way to solve the problem of information ‘leaking’. Could we manage to implement to your wonderful libsvm toolbox?

Best, Ziqian

在 2020年2月28日,上午11:19,王子谦 ziqianwang9@gmail.com 写道:

Thank your for your reply. As the knowledge I have, the nested is not 2-level CV. This figure could illustrate what is nested CV:

The nested CV has an inner loop CV nested in an outer CV. The inner loop is responsible for model selection/hyperparameter tuning (similar to validation set), while the outer loop is for error estimation (test set). My question is how is our '[bestacc,bestc] =SVMcgForClass_NoDisplay_linear(train_label,train_data(:,IX),-10,10,5,0.1)’ working on hyperparameter tuning? Do we use the similar method? If not, can we combine it with SVMcgForClass_NoDisplay_linear? Any response will be helpful. Best, Ziqian > 在 2020年2月20日,下午10:30,Chih-Jen Lin > 写道: > > To implement CV in matlab what you need to do are > > - randomly permute data by randperm() > > - use a for loop to get each validation fold > > num_per_fold = ceil(num_data/num_fold); > for i = 1 : num_fold > range = (i-1)*num_per_fold + 1 : min(num_data, i*num_per_fold); > > - then use this "range" to extract the validation fold. The training > fold can be get by a similar way > > - then do training/prediction, and aggregate results to get CV acuracy > > - for nested CV I think you mean 2-level CV. You can use a 2-level for > loop on that > > > On 2020-02-20 23:56, ziqianwang9 wrote: > > Dear Lin, > > thanks for providing this useful toolbox. I'm trying to use it to > > publish a paper, here I met some problem from the reviewer. He > > suggested me to use the nested cross-validation. > > Here I list the script I used for my study: > > > > clear all; > > load median20190923.mat > > > > %leave-one-out cross-validation > > w = zeros(size(data_all));% weight > > h = waitbar(0,'please wait..'); > > > > for i = 1:size(data_all,1) > > > > waitbar(i/size(data_all,1),h,[num2str(i),'/',num2str(size(data_all,1))]) > > new_DATA = data_all; > > new_label = label; > > test_data = data_all(i,:); new_DATA(i,:) = []; train_data = > > new_DATA; > > test_label = label(i,:);new_label(i,:) = [];train_label = > > new_label; > > > > % Data Normalization > > [train_data,PS] = mapminmax(train_data',0,1); > > test_data = mapminmax('apply',test_data',PS); > > train_data = train_data'; > > test_data = test_data'; > > > > % RFE feature selectioin > > step = 1; > > ftRank = SVMRFE(train_label,train_data, step,'-t 0'); > > IX = ftRank(1:ceil(length(ftRank)*0.4)); > > > > [bestacc,bestc] = > > SVMcgForClass_NoDisplay_linear(train_label,train_data(:,IX),-10,10,5,0.1); > > cmd = ['-t 0 ', ' -c ',num2str(bestc),' -w1 2 -w-1 1']; > > > > model = svmtrain(train_label,train_data(:,IX),cmd); > > w(i,IX) = model.SVs'*model.sv_coef; > > [predicted_label, accuracy, deci] = > > svmpredict(test_label,test_data(:,IX),model); > > acc(i,1) = accuracy(1); > > deci_value(i,1) = deci; > > % clear test_data train_data test_label train_label model IX k > > end > > w_msk = double(sum(w~=0,1)==size(w,1)); > > w = mean(w,1).*w_msk; > > acc_final = mean(acc); > > disp(['accuracy - ',num2str(acc_final)]); > > > > % ROC > > [X,Y,T,AUC] = perfcurve(label,deci_value,1); > > figure;plot(X,Y);hold on;plot(X,X,'-'); > > xlabel('False positive rate'); ylabel('True positive rate'); > > > > for i=1:length(X) > > Cut_off(i,1) = (1-X(i))*Y(i); > > end > > [~,maxind] = max(Cut_off); > > Specificity = 1-X(maxind); > > Sensitivty = Y(maxind); > > disp(['Specificity= ', num2str(Specificity)]); > > disp(['Sensitivty= ', num2str(Sensitivty)]); > > > > fprintf('Permutation test ......\n'); > > Nsloop = 5000; > > auc_rand = zeros(Nsloop,1); > > for i=1:Nsloop > > label_rand = randperm(length(label)); > > deci_value_rand = deci_value(label_rand); > > [~,~,~,auc_rand(i)] = perfcurve(label,deci_value_rand,1); > > clear label_rand > > end > > p_auc = (length(find((auc_rand > AUC)))+1)/(Nsloop+1); > > disp(['Pvalue= ', num2str(p_auc)]); > > > > Here, what I used is leave-one-out cross-valitaion. But the reviewer > > suggest me to use the neseted cross-valitaion(e.g. Varoquaux et al., > > Neuroimage, 2017) and K-fold. > > Since I am not familiar with nested cross-validation. Is it any > > possible we perform it based on your libsvm? If it is, could you > > please give me some clue how to achieve this? > > > > Best, > > Ziqian > > > > -- > > You are receiving this because you are subscribed to this thread. > > Reply to this email directly, view it on GitHub [1], or unsubscribe > > [2]. [ { "@context": "http://schema.org ", "@type": "EmailMessage", > > "potentialAction": { "@type": "ViewAction", "target": > > "https://github.com/cjlin1/libsvm/issues/163?email_source=notifications\u0026email_token=ABI3BHV62VSU7IEJTR5GH23RD2R2ZA5CNFSM4KYRVGX2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IPBCY3A ", > > "url": > > "https://github.com/cjlin1/libsvm/issues/163?email_source=notifications\u0026email_token=ABI3BHV62VSU7IEJTR5GH23RD2R2ZA5CNFSM4KYRVGX2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IPBCY3A ", > > "name": "View Issue" }, "description": "View this Issue on GitHub", > > "publisher": { "@type": "Organization", "name": "GitHub", "url": > > "https://github.com " } } ] > > > > Links: > > ------ > > [1] > > https://github.com/cjlin1/libsvm/issues/163?email_source=notifications&email_token=ABI3BHV62VSU7IEJTR5GH23RD2R2ZA5CNFSM4KYRVGX2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IPBCY3A > > [2] > > https://github.com/notifications/unsubscribe-auth/ABI3BHRYMTNKNGKRC3P4DTTRD2R2ZANCNFSM4KYRVGXQ > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub , or unsubscribe . >