cognoma / machine-learning

Machine learning for Project Cognoma
Other
32 stars 47 forks source link

Decisions required to reach a minimum viable product #44

Open dhimmel opened 8 years ago

dhimmel commented 8 years ago

We're nearing the point where we'll need to implement a machine learning module to execute user queries. We're looking to create a minimum viable product. We can expand functionality later, but for now let's focus on the simplest and most succinct implementation. There are several decisions to make:

  1. Classifier: which classifiers should we support? If we want to support only a single classifier for now, which one?
  2. Predictions: do we want to return probabilities, scores, or class predictions?
  3. Threshold: do we want to report performance measures that depend on a single classification threshold? Or do we want report performance that span thresholds?
  4. Testing: Do we want to use a testing partition in addition to cross-validation? If so, do we refit a model on all observations?
  5. Features Should we include covariates in addition to expression features (see #21)?
  6. Feature selection: Do we want to perform any feature selection?
  7. Feature extraction: Do we want to perform features extraction, such as PCA (see #43)?

So let's work out these choices, with a focus on simplicity.

dhimmel commented 8 years ago

Here are my thoughts:

  1. Classifier: sklearn.linear_model.SGDClassifier with a grid search to find the optimal l1_ratio and alpha. See 2.TCGA-MLexample.ipynb for an example.
  2. Predictions: let's return all three using the following object names probability, score, class under a predictions key. The frontend should handle cases where probability is absent.
  3. Threshold: Both.
  4. Testing: Let's hold out 10% for testing.
  5. Features deferring this decision based on the maturity of #21.
  6. Feature selection: let's do MAD feature selection to 8000 genes based on @yl565's findings in https://github.com/cognoma/machine-learning/issues/22#issuecomment-238113032. This should help speed up fitting the elastic net without too much performance loss.
  7. Feature extraction: deferring this decision based on the maturity of #43.

@gwaygenomics, @yl565, @stephenshank: do you agree?

gwaybio commented 8 years ago

Can you clarify what you mean by number 3?

Or do we want report performance that span thresholds?

Like AUROC?

dhimmel commented 8 years ago

By "span thresholds" I'm referring to any measure computed from predicted probabilities/scores, such as AUROC or AUPRC. By "single classification threshold", I'm referring to any measure computed from predicted classes, such as precision, recall, accuracy, or F1 score.

gwaybio commented 8 years ago

got it. Then yes, this all looks good to me

yl565 commented 8 years ago

+1

htcai commented 8 years ago

Sounds good!