cognoma / machine-learning

Machine learning for Project Cognoma
Other
32 stars 47 forks source link

First attempt at processing covariate information. #46

Closed stephenshank closed 8 years ago

stephenshank commented 8 years ago

All feedback is welcome. Also going to get some discussion going in #21.

dhimmel commented 8 years ago

In my previous review #46 should be #47. Oops!

stephenshank commented 8 years ago

covariates.tsv is < 3 MBs, so we should be able to track. If there is any strong preference for where it should reside, let me know, otherwise I will put it at the top level in the next PR.

Edit: On second thought, I will create a covariates directory with this notebook and put the tsv file there.

dhimmel commented 8 years ago

I would suggest perhaps making this notebook 1 in this repo and writing covariates.tsv to a data directory.

However, I'm starting to think that this work belongs in cancer-data? @gwaygenomics and @stephenshank, what do you think about adding it the the notebook pipeline in cancer data?

yl565 commented 8 years ago

Didn't know if it applies here but I often replace missing value with the predictions from other variables

yl565 commented 8 years ago

Is this information supposed to be used for mutation classification? If this is the case, perhaps it is better to handle the NaNs at the classification stage since each algorithm may have different preference?

stephenshank commented 8 years ago

@dhimmel I was actually thinking the same thing. I would be happy to move this over to cancer-data as the latest notebook, with covariates.tsv being placed in data so that it is tracked. Your suggestion above reduced file size to ~ 1.2 MB. Does this sound okay?

dhimmel commented 8 years ago

Is this information supposed to be used for mutation classification? If this is the case, perhaps it is better to handle the NaNs at the classification stage since each algorithm may have different preference?

@yl565, are there any sklearn models that handle missing values? What do you mean by "handle the NaNs at the classification stage"?

file size to ~ 1.2 MB. Does this sound okay?

stephenshank, sounds great.

I would be happy to move this over to cancer-data as the latest notebook, with covariates.tsv being placed in data so that it is tracked.

Okay let's move this to cancer-data. You should open a new PR there that references this pull request.

yl565 commented 8 years ago

@dhimmel sklearn.preprocessing.Imputer handles missing values which can be included in the classification pipeline. I mean for people doing mutation classification, they may prefer to process the missing values themselves rather than working with already pre-processed data (at least I am...). There are different ways to deal with the missing data which could interact with classification algorithms differently.

yl565 commented 8 years ago

A recommended way to deal with missing value could be set as default and it would be nice to have the option to get the raw, unpreprocessed data

dhimmel commented 8 years ago

There are different ways to deal with the missing data which could interact with classification algorithms differently.

Interesting... how about we perform imputation in cancer-data resulting in two possible covariate datasets: covariates.tsv and covariates-imputed.tsv (which would use a single imputation method we decide on). Then machine learning users would be free to select either the imputed covariates or perform imputation themselves.

yl565 commented 8 years ago

That would be nice!

gwaybio commented 8 years ago

Really nice analysis! I thought the age vs. mutation burden by tissue is very interesting... However, not sure it would be safe to impute age this way.

In general, my thoughts on imputation are as follows:

  1. We should definitely be sure to separate imputed covariates from non-imputed
    • I believe that our target audience (cancer biologists) will not view imputed variables kindly
  2. Survival associated covariates (age, vital status, time to recurrence, survival time) are tough to impute and even tougher to convince people we are imputing them correctly
    • Survival analyses are typically performed using this data and they have well studied methods for dealing with missing or data
  3. Missing data is a common problem for clinical metrics (even more of a problem for electronic health records!) so imputation is more of a research question than a part of our minimum viable product (see #44 )
  4. Imputing gender should be easy enough, lets at least try this one!

That being said, it could still be a interesting exercise and research question to try to impute all missing info.