cognoma / machine-learning

Machine learning for Project Cognoma
Other
32 stars 47 forks source link

Multi-task classification #80

Closed brankaj closed 7 years ago

brankaj commented 7 years ago

This notebook implements finding accuracy scores for genes that have targeted therapies. Since some agents target several genes, it is interesting to see can building an ensemble model improve results over a model that is built separately for each gene. Several methods support multi-task classification including Random forest. Multi-task version of Lasso also exists. These two methods are implemented in this notebook. Methods are evaluated using accuracy and precision scores.

Unfortunately, I am not sure does this notebook have enough findings to be a pull request. The reasons are as follows:

  1. Lasso and Multi-task lasso do not provide difference in results.
  2. Multi-task random forest provides improvement for some genes. For example, in the case of agent Ponatinib, there was a 1% improvement in accuracy score for genes KDR and ABL.
  3. All methods seem to fail to recognize true positives. This is the reason why precision scores are 0 and this part of code is in the comment section. I hoped that by using different class weights, this problem would be solved. However, I could not find any weights that were successful in recognizing more samples with label '1'.
dhimmel commented 7 years ago

Several methods support multi-task classification including Random forest. Multi-task version of Lasso also exists. These two methods are implemented in this notebook. Methods are evaluated using accuracy and precision scores.

Cool!

Unfortunately, I am not sure does this notebook have enough findings to be a pull request.

Always enough to be a pull request! I'll take a look.

dhimmel commented 7 years ago

Nice to see the MultiTaskLasso classifier in use.

All methods seem to fail to recognize true positives.

Your intuition is spot on. It looks like every sample is being classified as a negative. Hence, accuracy is just the proportion of samples that are negative. I'm guessing your balanced accuracy would be 0.5 and kappa would be 0 (i.e. no predictive ability). I think part of the problem could be that you're not trying different alpha (regularization strength) values. Usually, we use GridSearchCV to try a range of alpha values. While the GridSearchCV will make things slower, I think optimizing alpha is the place to start.