Regression Error at boundaries, is normalization on output required?

mchinen commented 4 years ago

When training with svm-train -s 4 -t 2 -n .6 -c .4 <myfile> I find that the predictions are very much compressed. For example, myfile has labels in the 1 to 5 region, with a significant in 4 to 5, but the highest predicted value on the train set is below 4.0. It seems that there are fewer predictions in the 1.0 to 2.0 region as well.

I've played with NU_SVR and EP_SVR and the other parameters and haven't found a good solution to this. Here is my train file. Even when normalizing the labels to 0-1 I get the same behavior, where the highest predicted value is .72.

First, I'd like to know if I'm doing something incorrectly. Next, if this is a correct model, why is it so compressed? I would like the predictions to be closer to the boundaries of the training labels. I understand that we would expect some compression towards the mean in regression, but this seems more than I would expect. Should I normalize the predicted output to match the input label distribution?

Unnormalized: mysvmtrainfile.txt Normalized: normsvmtrain.txt

cjlin1 commented 4 years ago

It seems you haven't done proper parameter selection

./gridregression.py ~/Downloads/mysvmtrainfile.txt ... [local] -1 -5 -8 0.55566 (best c=16.0, g=1.0, p=0.25, mse=0.294086) 16.0 1.0 0.25 0.294086

libsvm-3.24$ ./svm-train -s 3 -c 16 -g 1 -p 0.25 ~/Downloads/mysvmtrainfile.txt .. optimization finished, #iter = 1778 nu = 0.509791 obj = -979.425784, rho = -2.770594 nSV = 238, nBSV = 161 libsvm-3.24$ ./svm-predict ~/Downloads/mysvmtrainfile.txt mysvmtrainfile.txt.model o Mean squared error = 0.208275 (regression) Squared correlation coefficient = 0.786998 (regression)

A cross validation r^2 about 0.78 isn't too bad

libsvm-3.24$ wc -l o 376 o libsvm-3.24$ grep -e "4." o |wc -l 85 libsvm-3.24$ cut -f 1 -d ' ' ~/Downloads/mysvmtrainfile.txt | grep -e "4." |wc -l 100

On 2019-12-18 12:57, Michael Chinen wrote:

When training with svm-train -s 4 -t 2 -n .6 -c .4 I find that the predictions are very much compressed. For example, myfile has labels in the 1 to 5 region, with a significant in 4 to 5, but the highest predicted value on the train set is below 4.0. It seems that there are fewer predictions in the 1.0 to 2.0 region as well.

I've played with NU_SVR and EP_SVR and the other parameters and haven't found a good solution to this. I have Any ideas? Here is my train file. Even when normalizing the labels to 0-1 I get the same behavior, where the highest value is .72.

Unnormalized: mysvmtrainfile.txt [1] Normalized: normsvmtrain.txt [2]

-- You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub [3], or unsubscribe [4]. [ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/cjlin1/libsvm/issues/158?email_source=notifications\u0026email_token=ABI3BHTWSZKFX2YNIOVM6BLQZKFEXA5CNFSM4J4RT4N2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IBOQ3AQ", "url": "https://github.com/cjlin1/libsvm/issues/158?email_source=notifications\u0026email_token=ABI3BHTWSZKFX2YNIOVM6BLQZKFEXA5CNFSM4J4RT4N2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IBOQ3AQ", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

Links:

[1] https://github.com/cjlin1/libsvm/files/3980481/mysvmtrainfile.txt [2] https://github.com/cjlin1/libsvm/files/3980504/normsvmtrain.txt [3] https://github.com/cjlin1/libsvm/issues/158?email_source=notifications&email_token=ABI3BHTWSZKFX2YNIOVM6BLQZKFEXA5CNFSM4J4RT4N2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IBOQ3AQ [4] https://github.com/notifications/unsubscribe-auth/ABI3BHWMSRJXVU6YO3WPLXTQZKFEXANCNFSM4J4RT4NQ

mchinen commented 4 years ago

Thanks so much, that does seem to be the issue. I hadn't realized the importance of searching the parameters before reading your PDF, and used our last model's parameters. I modified grid.py to do a search and found better parameters which were wildly different. I found I also needed to tune the nu paramter.

However, I see my problem is confounded by another issue that I also resolved:

When I use the svm-predict myinput.txt mymodel.txt binary I got predictions that I expected
When I use svm_predict() after svm_load(mymodel.txt) I get different incorrect predictions, because I was zero indexing the .index variable. Once I resolved that things worked as expected.

cjlin1 / libsvm

Regression Error at boundaries, is normalization on output required? #158

Links: