cognoma / machine-learning

Machine learning for Project Cognoma
Other
32 stars 47 forks source link

SVC with rbf kernel #35

Closed htcai closed 8 years ago

htcai commented 8 years ago

The rbf kernel SVC takes much more time to train than does the LinearSVC. Therefore, I narrowed the search space of the two hyper-parameters: penalty C and kernel coefficient gamma.

In addition, I removed the cell of Top predictions amongst negatives, since the predicted probabilities are either 0 or 1, which might not be very informative. If it should be added back, I would be happy to do it.

Moreover, the section of coefficients is removed since this is only available to SVCs with linear kernel.

As for the results, the rbf kernel is very powerful and achieved an AUROC larger than 99% for the training data. But it cannot be well generalized to the testing data. Still, the testing AUROC (90.2%) is better than the LinearSVC (89.2%). The F1 score of this SVC (0.777) is also slightly better than the LinearSVC (0.768).

dhimmel commented 8 years ago

Really cool findings.

I removed the cell of Top predictions amongst negatives, since the predicted probabilities are either 0 or 1, which might not be very informative.

I think you could still use decision_function -- I think the question is whether it's appropriate to use decision_function for SVMs to rank predictions.

As for the results, the rbf kernel is very powerful and achieved an AUROC larger than 99% for the training data. But it cannot be well generalized to the testing data. Still, the testing AUROC (90.2%) is better than the LinearSVC (89.2%).

Really severe overfitting but still decent testing performance. I wonder if there's a parameter setting that brings the training and testing performance in line?

Looks good, but will look at more in depth later.

dhimmel commented 8 years ago

Can you rebase your htcai:SVC branch on the current master?

htcai commented 8 years ago

@dhimmel Thanks for your comments! I will rebase my SVC branch and also try to solve the problem of over-fitting by manipulating parameters such as the number of features.

dhimmel commented 8 years ago

I wouldn't worry too much about the overfitting. If you look at your grid_search AUROC heatmap: your cross-validated performance estimates are spot on (90.3%). While the training AUROC of your best_estimator_ is high, your testing AUROC is as expected from cross-validation. While I think it's ideal to have identical training and testing performance, it just may not be possible with this algorithm.

dhimmel commented 8 years ago

No need to close this PR! When pull requests are being worked on, it's customary to keep them open.

Since you deleted the branch, I'm not sure if you can reopen it.

htcai commented 8 years ago

Sorry, @dhimmel . I intended to sync my local master branch with the current master and then rebuild the SVC branch from the latest local master. Therefore, I deleted the old SVC branch both locally and remotely. I was not aware that this would lead to the closing of the pull request. I will try to restore the local branch and re-push it and see whether I can re-open the pull request.

htcai commented 8 years ago

According to the suggestions from @gwaygenomics and @yl565 , I kept 2000 features instead of 500. The training took close to 30 min. But the results seem to be desirable. The gap between the training AUROC (98.1%) and the testing AUROC (92.3%) is significantly smaller. There is also a big improvement in the F1 score (0.815).

Another way to handle over-fitting is to train the model with more data.

dhimmel commented 8 years ago

Therefore, I deleted the old SVC branch both locally and remotely. I was not aware that this would lead to the closing of the pull request.

Ah, if your master branch is up to date with the upstream (cognoma/machine-learning) master and your SVC is checked out, the following probably would have done the trick:

git rebase master

Anyways, looks like you've fixed the issue now. See how the "Files changed" tab is now correct.

htcai commented 8 years ago

Thanks for your instruction! Should I restore yesterday's commits? Is there anything else that I need to do for this SVC with rbf kernel?

dhimmel commented 8 years ago

Should I restore yesterday's commits?

Nope.

Pull request looks good to me. Will merge!

dhimmel commented 8 years ago

One thing before I merge. It looks like your commit is not associating with your GitHub account with the error message:

Unrecognized author. If this is you, make sure hcai.uva@gmail.com is associated with your account. Click to add email addresses in your account settings.

See more here. Basically you need to make sure GitHub and your local git settings point to the same email.

dhimmel commented 8 years ago

It's not essential to fix this, but it's nice so you can get credit.

htcai commented 8 years ago

Thank you for your careful observation! I just added the email address above (i.e., my local Git address) to my GitHub account.