what is the relationship between the number of training samples and the performance?

cjlin1 / libsvm

LIBSVM -- A Library for Support Vector Machines

https://www.csie.ntu.edu.tw/~cjlin/libsvm/

BSD 3-Clause "New" or "Revised" License

4.54k stars 1.64k forks source link

what is the relationship between the number of training samples and the performance? #102

Open carrierlxk opened 7 years ago

carrierlxk commented 7 years ago

I have faced an issue that: when I increase the number of the trainin samples (num=112000,dim=25), the performance of my SVM classifier is inferior to the one trained with down-sampling training samples (num = 22000, dim=25)? Could you explain this phenomanen?@cjlin1

cjlin1 commented 7 years ago

This is possible if parameter has not been properly selected

Lu Xiankai writes:

I have faced an issue that: when I increase the number of the trainin samples (num=112000,dim=25), the performance of my SVM classifier is inferior to the one trained with down-sampling training samples (num = 22000, dim=25)? Could you explain this phenomanen?@cjlin1

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread. *

carrierlxk commented 7 years ago

Thanks for your reply. We choose the RBF kernel, so the parameters you mentioned are C and gamma. Do you mean we should change the optimal values of C and gamma if we want to increase the training samples? Specifically, our SVM has achieved best performance on small training samples, but when we train the SVM on a large traning samples which are obtained by repulating five times of the origin samples, the performance declines a lot. Could you explain the reason? Thanks!

leepei commented 7 years ago

Don't know the word repulating but let me guess you mean replicating here. That is essentially the same as using the same C times five for the original small data. Using the same data points multiple times would not help your performance anyway, the best performance you can get will be the same as that from the original data, but using replicated points will cost you more time in training because the solver is uninformed about this situation. You need fresh data points.

carrierlxk commented 7 years ago

Thanks for your interpretations. My question is that, if I use full samples (num=112000,dim=25) to train binary SVM, the result (from cross-validation) is that: the accuracy of predicting -1 is 0.57 while the accuracy of predicting 1 is 0.29. When we down sample these samples (sub-sample num = 22000, dim=25) and train the binary SVM, the accuracy of predicting -1 is 0.59, while the accuracy of predicting 1 is 0.61 from cross-validation. In both cases, the ratio of label being 1 is about 0.41. Obviously, when we use the full sample, SVM worked much worse. We also output the decision values from cross-validation and rank them ascendingly into 50 levels. And on each level, we calculated the ratio of label being 1. For the sub-sample, monotone is shown between decision value levels and ratio of label being 1. But for the full sample, the relationship is very strange. Do you know the reason? Thanks!

cjlin1 commented 7 years ago

I think you should do parameter selection to check the best CV accuracy

Lu Xiankai writes:

Thanks for your interpretations. My question is that, if I use full samples (num=112000,dim=25) to train binary SVM, the result (from cross-validation) is that: the accuracy of predicting -1 is 0.57 while the accuracy of predicting 1 is 0.29. When we down sample these samples (sub-sample num = 22000, dim=25) and train the binary SVM, the accuracy of predicting -1 is 0.59, while the accuracy of predicting 1 is 0.61 from cross-validation. In both cases, the ratio of label being 1 is about 0.41. Obviously, when we use the full sample, SVM worked much worse. We also output the decision values from cross-validation and rank them ascendingly into 50 levels. And on each level, we calculated the ratio of label being 1. For the sub-sample, monotone is shown between decision value levels and ratio of label being 1. But for the full sample, the relationship is very strange. Do you know the reason? Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread. *

carrierlxk commented 7 years ago

In fact, we have trained SVM on three sample sets. one is the full sample with a very large size, one is sub-sample with 1/5 size of the full sample (using sampling method), the last one is obtained by replicating the sub-sample 5 times. We do the parameter selection using grid searching mentioned in libsvm manual for the sub-sample. But for the other two large size sample, we use the same parameters from the sub-sample, because when the sample size becomes very large, grid searching is time-costing. As we mentioned, the performance is much worse on the two large sample sets. So do you mean we should do grid searching again for the two large sample sets? We also read on a webpage that one disadvantage of SVM is that with a greater number of samples, SVM starts giving poor performances (https://dataaspirant.com/2017/01/13/support-vector-machine-algorithm/). Do you agree with this point and can you explain it? Thanks.

cjlin1 commented 7 years ago

You may try C <- C/(fold of large/small sets) but in general you should do parameter selection on the large set as well. Lu Xiankai writes:

In fact, we have trained SVM on three sample sets. one is the full sample with a very large size, one is sub-sample with 1/5 size of the full sample (using sampling method), the last one is obtained by replicating the sub-sample 5 times. We do the parameter selection using grid searching mentioned in libsvm manual for the sub-sample. But for the other two large size sample, we use the same parameters from the sub-sample, because when the sample size becomes very large, grid searching is time-costing. As we mentioned, the performance is much worse on the two large sample sets. So do you mean we should do grid searching again for the two large sample sets? We also read on a webpage that one disadvantage of SVM is that with a greater number of samples, SVM starts giving poor performances ( https://dataaspirant.com/2017/01/13/support-vector-machine-algorithm/ ). Do you agree with this point and can you explain it? Thanks.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread. *